CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
Ceph: A Scalable, High-Performance Distributed File System
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Sage Weil Scott Brandt Ethan Miller Darrell Long Carlos Maltzahn University of California, Santa.
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Overview of Lustre ECE, U of MN Changjin Hong (Prof. Tewfik’s group) Monday, Aug. 19, 2002.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Google File System.
Case Study - GFS.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Chapter 20 Distributed File Systems Copyright © 2008.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Ceph: A Scalable, High-Performance Distributed File System
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Distributed File Systems
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Awesome distributed storage system
Review CS File Systems - Partitions What is a hard disk partition?
11.6 Distributed File Systems Consistency and Replication Xiaolong Wu Instructor: Dr Yanqing Zhang Advanced Operating System.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
An Introduction to GPFS
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
File system: Ceph Felipe León fi Computing, Clusters, Grids & Clouds Professor Andrey Y. Shevel ITMO University.
Solutions for Fourth Quiz COSC 6360 Fall First question What do we mean when we say that NFS client requests are: (2×10 pts)  self-contained? 
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Slide credits: Thomas Kao
Google File System.
Google Filesystem Some slides taken from Alan Sussman.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
A Survey on Distributed File Systems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
System-Level Support CIS 640.
Presentation transcript:

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006

Paper highlights Yet another distributed file system using object storage devices Designed for scalability Main contributions 1.Uses hashing to achieve distributed dynamic metadata management 2.Pseudo-random data distribution function replaces object lists

System objectives Excellent performance and reliability Unparallel scalability thanks to –Distribution of metadata workload inside metadata cluster –Use of object storage devices (OSDs) Designed for very large systems –Petabyte scale (10 6 gigabytes)

Characteristics of very large systems Built incrementally Node failures are the norm Quality and character of workload changes over time

SYSTEM OVERVIEW System architecture Key ideas Decoupling data and metadata Metadata management Autonomic distributed object storage

System Architecture (I)

System Architecture (II) Clients –Export a near-POSIX file system interface Cluster of OSDs –Store all data and metadata –Communicate directly with clients Metadata server cluster –Manages the namespace (files + directories) –Security, consistency and coherence

Key ideas Separate data and metadata management tasks - Metadata cluster does not have object lists Dynamic partitioning of metadata data tasks inside metadata cluster –Avoids hot spots Let OSDs handle file migration and replication tasks

Decoupling data and metadata Metadata cluster handles metadata operations Clients interact directly with OSD for all file I/O Low-level bloc allocation is delegated to OSDs Other OSD still require metadata cluster to hold object lists – Ceph uses a special pseudo-random data distribution function (CRUSH)

Metadata management Dynamic Subtree Partitioning –Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs) –Sharing is dynamic and based on current access patterns Results in near-linear performance scaling in the number of MDSs

Autonomic distributed object storage Distributed storage handles data migration and data replication tasks Leverages the computational resources of OSDs Achieves reliable highly-available scalable object storage Reliable implies no data losses Highly available implies being accessible almost all the time

THE CLIENT Performing an I/O Client synchronization Namespace operations

Performing an I/O When client opens a file –Sends a request to the MDS cluster –Receives an i-node number, information about file size and striping strategy and a capability Capability specifies authorized operations on file ( not yet encrypted ) –Client uses CRUSH to locate object replica –Client releases capability at close time

Client synchronization (I) POSIX requires –One-copy serializability –Atomicity of writes When MDS detects conflicting accesses by different clients to the same file –Revokes all caching and buffering permissions –Requires synchronous I/O to that file

Client synchronization (II) Synchronization handled by OSDs –Locks can be used for writes spanning object boundaries Synchronous I/O operations have huge latencies Many scientific workloads do significant amount of read-write sharing – POSIX extension lets applications synchronize their concurrent accesses to a file

Namespace operations Managed by the MDSs –Read and update operations are all synchronously applied to the metadata Optimized for common case – readdir returns contents of whole directory (as NFS readdirplus does) Guarantees serializability of all operations –Can be relaxed by application

THE MDS CLUSTER Storing metadata Dynamic subtree partitioning Mapping subdirectories to MDSs

Storing metadata Most requests likely to be satisfied from MDS in- memory cache Each MDS lodges its update operations in lazily- flushed journal –Facilitates recovery Directories –Include i-nodes –Stored on a OSD cluster

Dynamic subtree partitioning Ceph uses primary copy approach to cached metadata management Ceph adaptively distributes cached metadata across MDS nodes – Each MDS measures popularity of data within a directory –Ceph migrates and/or replicates hot spots

Mapping subdirectories to MDSs

DISTRIBUTED OBJECT STORAGE Data distribution with CRUSH Replication Data safety Recovery and cluster updates EBOFS

Data distribution with CRUSH (I) Wanted to avoid storing object addresses in MDS cluster Ceph firsts maps objects into placement groups (PG) using a hash function Placement groups are then assigned to OSDs using a pseudo-random function (CRUSH) –Clients know that function

Data distribution with CRUSH (II) To access an object, client needs to know –Its placement group –The OSD cluster map –The object placement rules used by CRUSH Replication level Placement constraints

How files are striped

Replication Ceph’s Reliable Autonomic Data Object Store autonomously manages object replication First non-failed OSD in object’s replication list acts as a primary copy –Applies each update locally –Increments object’s version number –Propagates the update

Data safety Achieved by update process 1.Primary forwards updates to other replicas 2.Sends ACK to client once all replicas have received the update Slower but safer 3.Replicas send final commit once they have committed update to disk

Committing writes

Recovery and cluster updates RADOS (Reliable and Autonomous Distributed Object Store) monitors OSDs to detect failures Recovery handled by same mechanism as deployment of new storage –Entirely driven by individual OSDs

Low-level storage management Most DFS use an existing local file system to manage low-level storage –Hard understand when object updates are safely committed on disk Could use journaling or synchronous writes Big performance penalty

Low-level storage management Each Ceph OSD manages its local object storage with EBOFS (Extent and B-Tree based Object File System) – B-Tree service locates objects on disk –Block allocation is conducted in term of extents to keep data compact –Well-defined update semantics

PERFORMANCE AND SCALABILITY Want to measure –Cost of updating replicated data Throughput and latency –Overall system performance –Scalability –Impact of MDS cluster size on latency

Impact of replication (I)

Impact of replication (II) Transmission times dominate for large synchronized writes

File system performance

Scalability Switch is saturated at 24 OSDs

Impact of MDS cluster size on latency

Conclusion Ceph addresses three critical challenges of modern DFS –Scalability –Performance –Reliability Achieved though reducing the workload of MDS –CRUSH –Autonomous repairs of OSD