Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/10/20011 Cluster File Systems, Inc Peter J. BraamTim Reddin The Lustre Storage Architecture.

Similar presentations


Presentation on theme: "6/10/20011 Cluster File Systems, Inc Peter J. BraamTim Reddin The Lustre Storage Architecture."— Presentation transcript:

1 6/10/20011 Cluster File Systems, Inc Peter J. BraamTim Reddin braam@clusterfs.com tim.reddin@hp.com http://www.clusterfs.com The Lustre Storage Architecture Linux Clusters for Super Computing Linköping 2003

2 2 - NSC 2003 Topics History of project High level picture Networking Devices and fundamental API’s File I/O Metadata & recovery Project status Cluster File Systems, Inc

3 3 - NSC 2003 Lustre’s History

4 4 - NSC 2003 Project history 1999 CMU & Seagate Worked with Seagate for one year Storage management, clustering Built prototypes, much design Much survives today

5 5 - NSC 2003 2000-2002 File system challenge First put forward Sep 1999 Santa Fe New architecture for National Labs Characteristics: 100’s GB’s/sec of I/O throughput trillions of files 10,000’s of nodes Petabytes From start Garth & Peter in the running

6 6 - NSC 2003 2002 – 2003 fast lane 3 year ASCI Path Forward contract with HP and Intel MCR & ALC, 2x 1000 node Linux Clusters PNNL HP IA64, 1000 node Linux cluster Red Storm, Sandia (8000 nodes, Cray) Lustre Lite 1.0 Many partnerships (HP, Dell, DDN, …)

7 7 - NSC 2003 2003 – Production, perfomance Spring and summer LLNL MCR from no, to partial, to full time use PNNL similar Stability much improved Performance Summer 2003: I/O problems tackled Metadata much faster Dec/Jan Lustre 1.0

8 8 - NSC 2003 High level picture

9 9 - NSC 2003 Lustre Systems – Major Components Clients Have access to file system Typical role: compute server OST Object storage targets Handle (stripes of, references to) file data MDS Metadata request transaction engine. Also: LDAP, Kerberos, routers etc.

10 10 - NSC 2003 OST 1 OST 2 OST 7 OST 3 OST 6 OST 5 OST 4 GigE QSW Net Lustre Clients (1,000 Lustre Lite) Up to 10,000’s MDS 1 (active) MDS 2 (failover) Lustre Object Storage Targets (OST) Linux OST Servers with disk arrays 3 rd party OST Appliances SAN

11 11 - NSC 2003 Clients Object Storage Targets (OST) Meta-data Server (MDS) LDAP Server configuration information, network connection details, & security management file I/O & file locking recovery, file status, & file creation directory operations, meta-data, & concurrency

12 12 - NSC 2003 Networking

13 13 - NSC 2003 Lustre Networking Currently runs over: TCP Quadrics Elan 3 & 4 Lustre can route & can use heterogeneous nets Beta Myrinet, SCI Under development SAN (FC/iSCSI), I/B Planned: SCTP, some special NUMA and other nets

14 14 - NSC 2003 Lustre Network Stack - Portals Device Library (Elan,Myrinet,TCP,...)Portal NAL’sPortal LibraryNIO APILustre Request Processing Network Abstraction Layer for TCP, QSW, etc. Small & hard Includes routing api. Sandia’s API, CFS improved impl. Move small & large buffers, Remote DMA handling, Generate events 0-copy marshalling libraries, Service framework, Client request dispatch, Connection & address naming, Generic recovery infrastructure

15 15 - NSC 2003 Devices and API’s

16 16 - NSC 2003 Lustre has numerous driver modules One API - very different implementations Driver binds to named device Stacking devices is key Generalized “object devices” Drivers currently export several API’s Infrastructure - a mandatory API Object Storage Metadata Handling Locking Recovery Lustre Devices & API’s

17 17 - NSC 2003 Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels) Logical Object Volume (LOV driver) OSC1 … OSCn Data Object & Lock MDC Metadata & Lock MDC … Clustered MD driver Lustre Clients & API’s

18 18 - NSC 2003 Object Storage Api Objects are (usually) unnamed files Improves on the block device api create, destroy, setattr, getattr, read, write OBD driver does block/extent allocation Implementation: Linux drivers, using a file system backend

19 19 - NSC 2003 Networking Object-Based Disk Server (OBD server Lock Server Fibre Channel Recovery Load Balancing MDS Server Lock Server Ext3, Reiser, XFS, … FS Recovery Networking Lock Client Directory Metadata & Concurrency MDS Lustre Client File System Metadata WB cache OSC’sMDC Lock Client Networking Recovery Device (Elan,TCP,…) Portal NAL’s Portal Library NIO API Request Processing System & Parallel File I/O, File Locking Recovery, File Status, File Creation OST Fibre Channel Ext3, Reiser, XFS,… FS Bringing it all together

20 20 - NSC 2003 File I/O

21 21 - NSC 2003 File I/O – Write Operation Open file on meta-data server Get information on all objects that are part of file: Objects id’s What storage controllers (OST) What part of the file (offset) Striping pattern Create LOV, OSC drivers Use connection to OST Object writes to OST No MDS involvement at all

22 22 - NSC 2003 MDC Lustre ClientMeta-data Server MDS OST 1OST 2OST 3 OSC 2 File meta-data Inode A {(O1,obj1),(O3, obj2)} File open request Write (obj 1) Write (obj 2) OSC 1 LOV File system

23 23 - NSC 2003 I/O bandwidth 100’s GB/sec => saturate many100’s OSTs OST’s: Do ext3 extent allocation, non-caching direct I/O Lock management spread over cluster Achieve 90-95% of network throughput Single client, single thread Elan3: W 269MB/sec OST’s handle up to 260MB/sec W/O extent code, on 2 way 2.4GHz Xeon

24 24 - NSC 2003 Metadata

25 25 - NSC 2003 Intent locks & Write Back caching Clients – MDS: protocol adaptation Low concurrency - write back caching Client in memory updates delayed replay to MDS High concurrency (mostly merged in 2.6) Single network request per transaction No lock revocations to clients Intent based lock includes complete request

26 26 - NSC 2003 ClientFile ServerNetwork a) Conventional mkdir lookup mkdir Lustre Client lookup Lustre_mkdir Meta-data Server lock module Mds_mkdir Network b) Lustre mkdir lookup mkdir lookup intent mkdir exercise the intent Client lookup mkdir File Server lookup mkdir Network a) Conventional mkdir create dir lookup mkdir

27 27 - NSC 2003 Lustre 1.0 Only has high concurrency model Aggregate throughput (1,000 clients): Achieve ~5000 file creations (open/close) /sec Achieve ~7800 stat’s in 10 x1M file directories Single client: Around 1500 creations or stat’s /sec Handling 10M file directories is effortless Many changes to ext3 (all merged in 2.6)

28 28 - NSC 2003 Metadata Future Lustre 2.0 – 2004 Metadata clustering Common operations will parallelize 100% WB caching in memory or on disk Like AFS

29 29 - NSC 2003 Metadata Odds and Ends Logical drivers: Local persistent metadata cache, like AFS/Coda/InterMezzo Replicated metadata server driver Remotely mirrored MDS Small scale clusters CFS focused on big systems Our drivers: ordinary FS can export all protocols Get shared ext3/Reiser/.. file systems

30 30 - NSC 2003 Recovery

31 31 - NSC 2003 Recovery approach Keep it simple! Based on failover circles: Use existing failover software Left working neighbor is failover node for you At HP we use failover pairs Simplify storage connectivity I/O failure triggers Peer node serves failed OST Retry from client routed to new OST node

32 32 - NSC 2003 OST Server – redundant pair C2C1 C2 OST1OST 2 FC Switch

33 33 - NSC 2003 Configuration

34 34 - NSC 2003 Lustre 1.0 Good tools to build configuration Configuration is recorded on MDS Or on dedicated management server Configuration can be changed, 1.0 requires downtime Clients auto configure mount –t lustre –o … mds://fileset/sub/dir /mnt/pt SNMP support

35 35 - NSC 2003 Futures

36 36 - NSC 2003 Advanced Management Snapshots All features you might expect Global namespace Combine best of AFS & autofs4 HSM, hot migration Driven by customer demand (we plan XDSM) Online 0-downtime re-configuration Part of Lustre 2.0

37 37 - NSC 2003 Security

38 38 - NSC 2003 Authentication POSIX style authorization NASD style OST authorization Refinement: use OST ACL’s and cookies File crypting with group key service STK secure file system Security

39 39 - NSC 2003 LDAP group Server Kerberos Client Step 8: Decrypt file data MDS Step 3: Traverse ACL’s OST Step 1: Authenticate user, get session key Step 2: Authenticated open RPCs Step 4: Get OST ACL Step 5: Send ACL capability& cookie Step 6: Read encrypted file data Step 7: Get SFS file key

40 40 - NSC 2003 CFS Cluster Tools for 2.6 Remote serial GDB debugging over UDP Conman UDP consoles for syslog sysrq Core dumps over net or to local disk Many dump format enhancements Analyze dumps with gdb extension (not lcrash) Llanalyze Analyzes distributed Lustre logs

41 41 - NSC 2003 Metadata transaction protocol No synchronous I/O unless requested Reply and commit confirmation Lustre covers single component failure Replay of requests central Preserve transaction sequence Acknowledge replies to remove barriers Avoid cascading aborts In DB parlor: strict execution

42 42 - NSC 2003 Distributed persistent data Happens in many places: Inode & object creation/removal (MDS/OST) Replicating OST’s Metadata clustering Recovery with replay logs Cancellation of log records Logs ubiquitous in Lustre: Recovery, WB caching logs, replication etc. Configuration

43 43 - NSC 2003 Project status

44 44 - NSC 2003 Lustre Feature Roadmap Lustre (Lite) 1.0 (Linux 2.4 & 2.6) Lustre 2.0 (2.6)Lustre 3.0 200320042005 Failover MDSMetadata cluster Basic Unix security Advanced Security File I/O very fast (~100’s OST’s) Collaborative read cache Storage management Intent based scalable metadata Write back metadata Load balanced MD POSIX compliantParallel I/OGlobal namespace

45 45 - NSC 2003 Cluster File Systems, Inc.

46 46 - NSC 2003 Cluster File Systems Small service company: 20-30 people Software development & service (95% Lustre) contract work for Government labs OSS but defense contracts Extremely specialized and extreme expertise we only do file systems and storage Investments - not needed. Profitable. Partners: HP, Dell, DDN, Cray

47 47 - NSC 2003 Lustre – conclusions Great vehicle for advanced storage software Things are done differently Protocols & design from Coda & InterMezzo Stacking & DB recovery theory applied Leverage existing components Initial signs promising

48 48 - NSC 2003 HP & Lustre Two projects ASCI PathForward – Hendrix Lustre Storage product Field trial in Q1 of 04

49 49 - NSC 2003 Questions?


Download ppt "6/10/20011 Cluster File Systems, Inc Peter J. BraamTim Reddin The Lustre Storage Architecture."

Similar presentations


Ads by Google