Presentation is loading. Please wait.

Presentation is loading. Please wait.

Awesome distributed storage system

Similar presentations


Presentation on theme: "Awesome distributed storage system"— Presentation transcript:

1 Awesome distributed storage system
Philippe Raipin presentation title

2 Ceph-History At the beginning (2006): part of Sage Weil ph.d researches After graduation (2007): Open source with 3 main developers 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers) April 2014, Red Hat acquired Inktank ($175 Million) university of California, Santa Cruz presentation title

3 Open Source Project www.ceph.com www.github.com/ceph
9th release (Infernalis, 11/2015)

4 Ceph-Target Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability. Designed for commodity hardware Software-based Open Source Self managing/healing Self balancing painless scaling No SPOF Object Storage (S3, Swift) Block Storage File System (POSIX) C A P X Ceph's main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available. The data is replicated, making it fault tolerant presentation title

5 Architecture outline

6 RBD Rados Block Device Thin provisioning Snapshot / Clone
Can be used as OpenStack Cinder backend Can be used by Libvirt SAN replacement

7 CephFS File System POSIX Compliant (legacy) Network File System
Plugin for hadoop (HDFS alternative) Kernel (> / 2010) or FUSE NFS/CIFS replacement

8 RGW Rados GateWay : HTTP REST gateway for the RADOS object store
AWS S3 Compliant OpenStack Swift Compliant S3/SWIFT replacement

9 What services can be easily done with a Ceph ?
A DropBox like (Ceph RGW + OwnCloud) A provider of volumes (Ceph RBD) A NFS like (CephFS) All with the same Ceph cluster

10 a Ceph cluster

11 Ceph-Concept Object Servers (OS): store the objects
Monitor servers (MON): watch over the storage network, maintain the group membership, ensure consistency (Strong consistency) Metadata Servers (MDS): store the file system structure A service uses a Pool that is composed of Placement Groups a placement group is a storage space distributed over n OS Crush Map : defines placement rules Replication VS Erasure Code Cache-tiering

12 Entities PG0 PG1 Ceph Client Pool PG2 CRUD … PGn Client host OSD host
OSD0a OSD0b OSD0c PG0 OSD1a OSD1b OSD1c PG1 OSD2a OSD2b OSD2c Ceph Client Pool PG2 CRUD OSDna OSDnb OSDnc un osd peut “appartenir” à plusieurs PG PGn Client host OSD host

13 Pool Type Resiliency Replicated Erasure Code
each PG is composed of n OSD One OSD is designed as Primary IO is done on Primary Each object is copied on the other OSD by the Primary (strong consistency : ack after copy) Erasure Code a pool can have an erasure code profiles (params k, m) each PG is composed of k+m OSD IO is done on Primary (encode and decode) Each object is encoded into k+m chunks by the Primary and then spread to the k+m OSD (strong consistency : ack after creation) the default erasure code library : jerasure other lib can be used dynamically (plugin)

14 Erasure code overview

15 Object Store Device OSD Daemon FS (xattr) Disk
atomic transactions : put, get, delete, … FS (xattr) Disk brtfs, xfs, ext4 OSD is primary for some objects : responsible for resiliency responsible for coherency responsible for re-balancing responsible for recovery OSD is secondary for some objects : under control of primary capable of becoming primary

16 Object Placement

17 CRUSH Controlled Replication Under Scalable Hashing
a pseudo-random deterministic data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster. This avoids the need for an index server to coordinate reads and writes. Based on OSD weight rules This avoids the need for an index server to coordinate reads and writes. S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, Nov ACM.

18 OSDs Buckets Rules host ceph-osd-ssd-server-1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host ceph-osd-ssd-server-2 { id -2 item osd.2 weight 1.00 item osd.3 weight 1.00 host ceph-osd-platter-server-1 { id -3 item osd.4 weight 1.00 item osd.5 weight 1.00 host ceph-osd-platter-server-2 { id -4 item osd.6 weight 1.00 item osd.7 weight 1.00 root platter { id -5 item ceph-osd-platter-server-1 weight 2.00 item ceph-osd-platter-server-2 weight 2.00 root ssd { id -6 item ceph-osd-ssd-server-1 weight 2.00 item ceph-osd-ssd-server-2 weight 2.00 rule data { ruleset 0 type replicated min_size 2 max_size 2 step take platter step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 min_size 0 max_size 10 rule rbd { ruleset 2 rule platter { ruleset 3 rule ssd { ruleset 4 max_size 4 step take ssd device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 Straw: List and Tree buckets use a divide and conquer strategy in a way that either gives certain items precedence (e.g., those at the beginning of a list) or obviates the need to consider entire subtrees of items at all. That improves the performance of the replica placement process, but can also introduce suboptimal reorganization behavior when the contents of a bucket change due an addition, removal, or re-weighting of an item. The straw bucket type allows all items to fairly “compete” against each other for replica placement through a process analogous to a draw of straws.

19 Cache Tiering 2 modes : Write back and Read Only

20 Monitor Maintains Cluster state and history
Mon Map OSD Map PG Map Crush Map MDS Map Every Ceph client has a list of Mons (addresses)

21 Dependability Monitors use a consensus algorithm to maintain the maps
all the mons have the same map view (strong consistency) based on Quorum, then if half the monitors crashes/disappears, the system won’t be available L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001. collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen

22 Metadata Server (MDS) Store metadata of CephFS (permission bits, ACL, ownership, …) The data are stored into a Ceph pool (not locally) Cache the metadata Provide high availability of metadata (multiple MDS)

23 MDS adaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn: Ceph: A Scalable, High-Performance Distributed File System. OSDI 2006: 

24 Ceph-Status Portal: www.ceph.com Code at https://github.com/ceph/ceph
Versions Infernalis (11/2015) Hammer (04/2015) Giant (10/2014) Firefly (05/2014) Emperor (11/2013) Dumpling (08/2013) Cuttlefish (05/2013) Bobtail (01/2013) Argonaut (07/2012) License : LGPL v2.1, BSD, MIT, Apache 2 … On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version (2010). Active contributors: ~120 Very active community Clients mount the POSIX-compatible file system using a Linux kernel client. An older FUSE-based client is also available. The servers run as regular Unix daemons. community ceph indepebdant of inktank presentation title

25 inkScope is a Ceph visualization and admin interface
Open source : version 1.3 (23/12/2015)

26 Architecture

27

28

29


Download ppt "Awesome distributed storage system"

Similar presentations


Ads by Google