Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel.

Similar presentations

Presentation on theme: "Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel."— Presentation transcript:

1 Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel

2 In the Before…  Lets go back through some of the mentionable distributed file systems used in HPC

3 In the Before…  There were distributed filesystems like:  Lustre – RAID over storage boxes  Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one)  When functional, reading/writing EXTREMELY fast  Used in heavily in HPC

4 In the Before…  There were distributed filesystems like:  NFS – Network File System  Does this really count as distributed?  Single large server  Full POSIX support, in kernel since…forever  Slow with even a moderate number of clients  Dead simple

5 In the Current…  There are distributed filesystems like:  Hadoop – Apache Project inspired by Google  Massive throughput  Throughput scales with attached HDs  Have seen VERY LARGE production clusters  Facebook, Yahoo… Nebraska  Doesn’t even pretend to be POSIX

6 In the Current…  There are distributed filesystems like:  GPFS(IBM) / Panasas – Propriety file systems  Requires closed source kernel driver  Not flexible with newest kernels / OS’s  Good: Good support and large communities  Can be treated as black box for administrators  HUGE Installments (Panasas at LANL is HUGE!!!! )

7 Motivation  Ceph is a emerging technology in the production clustered environment  Designed for:  Performance – Striped data over data servers.  Reliability – No single point of failure  Scalability – Adaptable metadata cluster

8 Timeline  2006 – Ceph Paper written  2007 – Sage Weil earned PhD from Ceph (largely)  2007 – 2010 Development continued, primarily for DreamHost  March 2010 – Linus merged Ceph client into mainline kernel  No more patches needed for clients

9 Adding Ceph to Mainline Kernel  Huge development!  Significantly lowered cost to deploy Ceph  For production environments, it was a little too late – was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).

10 Lets talk paper Then I’ll show a quick demo

11 Ceph Overview  Decoupled data and metadata  IO directly with object servers  Dynamic distributed metadata management  Multiple metadata servers handling different directories (subtrees)  Reliable autonomic distributed storage  OSD’s manage themselves by replicating and monitoring

12 Decoupled Data and Metadata  Increases performance by limiting interaction between clients and servers  Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas…  In contrast to other filesystems, CEPH uses a function to calculate the block locations

13 Dynamic Distributed Metadata Management  Metadata is split among cluster of servers  Distribution of metadata changes with the number of requests to even load among metadata servers  Metadata servers also can quickly recover from failures by taking over neighbors data  Improves performance by leveling metadata load

14 Reliable Autonomic Distributed Storage  Data storage servers act on events by themselves  Initiates replication and  Improves performance by offloading decision making to the many data servers  Improves reliability by removing central control of the cluster (single point of failure)

15 Ceph Components  Some quick definitions before getting into the paper  MDS – Meta Data Server  ODS – Object Data Server  MON – Monitor (Now fully implemented)

16 Ceph Components  Ordered: Clients, Metadata, Object Storage 1 2 3

17 Ceph Components  Ordered: Clients, Metadata, Object Storage 1 2 3

18 Client Overview  Can be a Fuse mount  File system in user space  Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system)  Can link directly to the Ceph Library  Built into newest OS’s.

19 Client Overview – File IO  1. Asks the MDS for the inode information

20 Client Overview – File IO  2. Responds with the inode information

21 Client Overview – File IO  3. Client Calculates data location with CRUSH

22 Client Overview – File IO  4. Client reads directly off storage nodes

23 Client Overview – File IO  Client asks MDS for a small amount of information  Performance: Small bandwidth between client and MDS  Performance Small cache (memory) due to small data  Client calculates file location using function  Reliability: Saves the MDS from keeping block locations  Function described in data storage section

24 Ceph Components  Ordered: Clients, Metadata, Object Storage 1 2 3

25 Client Overview – Namespace  Optimized for the common case, ‘ls –l’  Directory listing immediately followed by a stat of each file  Reading directory gives all inodes in the directory  Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzel swanson 63 Aug apache drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzel swanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzel swanson 75 Jan 18 12:25 buildsys-macros

26 Metadata Overview  Metadata servers (MDS) server out the file system attributes and directory structure  Metadata is stored in the distributed filesystem beside the data  Compare this to Hadoop, where metadata is stored only on the head nodes  Updates are staged in a journal, flushed occasionally to the distributed file system

27 MDS Subtree Partitioning  In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients  In order to be scalable, Ceph needs to distributed metadata requests among many servers  MDS will monitor frequency of queries using special counters  MDS will compare the counters with each other and split the directory tree to evenly split the load

28 MDS Subtree Partitioning  Multiple MDS split the metadata  Clients will receive metadata partition data from the MDS during a request

29 MDS Subtree Partitioning  Busy directories (multiple creates or opens) will be hashed across multiple MDS’s

30 MDS Subtree Partitioning  Clients will read from random replica  Update to the primary MDS for the subtree

31 Ceph Components  Ordered: Clients, Metadata, Object Storage 1 2 3

32 Data Placement  Need a way to evenly distribute data among storage devices (OSD)  Increased performance from even data distribution  Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution  Problem: Don’t want to keep data locations in the metadata servers  Requires lots of memory if lots of data blocks

33 CRUSH  CRUSH is a pseudo-random function to find the location of data in a distributed filesystem  Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored  Input data is:  inode number – From MDS  OSD Cluster Map (CRUSH map) – From OSD/Monitors

34 CRUSH  CRUSH maps a file to a list of servers that have the data

35 CRUSH  File to Object: Takes the inode (from MDS)

36 CRUSH  File to Placement Group (PG): Object ID and number of PG’s

37 Placement Group  Sets of OSDs that manage a subset of the objects  OSD’s will have many Placement Groups  Placement Groups will have R OSD’s, where R is number of replicas  An OSD will either be a Primary or Replica  Primary is in charge of accepting modification requests for the Placement Group  Clients will write to Primary, read from random member of Placement Group

38 CRUSH  PG to OSD: PG ID and Cluster Map (from OSD)

39 CRUSH  Now we know where to write the data / read the data  Now how do we safely handle replication and node failures?

40 Replication  Replicates to nodes also in the Placement Group

41 Replication  Write the the placement group primary (from CRUSH function).

42 Replication  Primary OSD replicates to other OSD’s in the Placement Group

43 Replication  Commit update only after the longest update

44 Failure Detection  Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!).  Monitors keep a cluster map (used in CRUSH)  Multiple monitors keep eye on cluster configuration, dole out cluster maps.

45 Recovery & Updates  Recovery is entirely between OSDs  OSD have two off modes, Down and Out.  Down is when the node could come back, Primary for a PG is handed off  Out is when a node will not come back, data is re- replicated.

46 Recovery & Updates  Each object has a version number  Upon bringing up, check version number of Placement Groups to see if current  Check version number of objects to see if need update

47 Ceph Components  Ordered: Clients, Metadata, Object Storage (Physical) 1 2 4

48 Object Storage  The underlying filesystem can make or break a distributed one  Filesystems have different characteristics  Example: RieserFS good at small files  XFS good at REALLY big files  Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanle attrs.

49 Object Storage  Ceph can run on normal file systems, but slow  XFS, ext3/4, …  Created own Filesystem in order to handle special object requirements of Ceph  EBOFS – Extent and B-Tree based Object File System.

50 Object Storage  Important to note that development of EBOFS has ceased  Though Ceph can run on any normal filesystem (I have it running on ext4)  Hugely recommend to run on BTRFS

51 Object Storage - BTRFS  Fast Writes: Copy on write file system for Linux  Great Performance: Supports small files with fast lookup using B-Tree algorithm  Ceph Requirement: Supports unlimited chaining of attributes  Integrated into mainline kernel  Considered next generation file system  Peer of ZFS from Sun  Child of ext3/4

52 Performance and Scalability Lets look at some graphs!

53 Performance & Scalability  Write latency with different replication factors  Remember, has to write to all replicas before ACK write to client

54 Performance & Scalability  X-Axis is size of the write to Ceph  Y-Axis is the Latency when writing X KB

55 Performance & Scalability  Notice, this is still small writes, < 1MB  As you can see, the more replicas Ceph has to write, the slower the ACK to the client

56 Performance & Scalability  Obviously, async write is faster  Latency for async is from flushing buffers to Ceph

57 Performance and Scalability  2 lines for each file system  Writes are bunched at top, reads at bottom

58 Performance and Scalability  X-Axis is the KBs written to or read from  Y-Axis is the throughput per OSD (node)

59 Performance and Scalability  The custom ebofs does much better on both writes and reads

60 Performance and Scalability  Writes for ebofs max the throughput of the underlying HD

61 Performance and Scalability  X-Axis is size of the cluster  Y-Axis is the per OSD throughput

62 Performance and Scalability  Most configurations hover around HD speed

63 Performance and Scalability  32k PGs will distribute data more evenly over the cluster than the 4k PGs

64 Performance and Scalability  Evenly splitting the data will lead to a balanced load across the OSDs

65 Conclusions  Very fast POSIX compliant file system  General enough for many applications  No single point of failure – Important for large data centers  Can handle HPC like applications (lots of metadata, small files)

66 Demonstration  Started 3 Fedora 16 instances on HCC’s private cloud

67 Demonstration  Some quick things if the demo doesn’t work  MDS log of a MDS handing off a directory to another for load balancing :15: f964654b700 mds.0.migrator nicely exporting to mds.1 [dir /hadoop-grid/ [2,head] auth{1=1} pv=2574 v=2572 cv=0/0 ap=1+2+3 state= |complete f(v2 m :14: =0+1) n(v86 rc :15: b =213+79) hs=1+8,ss=0+0 dirty=9 | child replicated dirty authpin 0x29a0fe0]

68 Demonstration  Election after a Monitor was overloaded  Lost another election (peon  ): :23: fcf log [INF] : mon.gamma calling new monitor election :23: fcf log [INF] : mon.gamma calling new monitor election :23: fcf log [INF] : won leader election with quorum 1, :15: f50b360e700 e26 e26: 3 osds: 2 up, 3 in

69 GUI Interface

70 Where to Find More Info  New company sponsoring development   Instruction on setting up CEPH can be found on the Ceph wiki:   Or my blog 

Download ppt "Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel."

Similar presentations

Ads by Google