Presentation is loading. Please wait.

Presentation is loading. Please wait.

ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004.

Similar presentations

Presentation on theme: "ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004."— Presentation transcript:

1 ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

2 A.Maslennikov - May SLAB update2 Participated : ADIC Software: E.Eastman CASPUR: A.Maslennikov(*), M.Mililotti, G.Palumbo CERN : C.Curran, J.Garcia Reyero, M.Gug, A.Horvath, J.Iven, P.Kelemen, G.Lee, I.Makhlyueva, B.Panzer-Steindel, R.Többicke, L.Vidak DataDirect Networks : L.Thiers ENEA : G.Bracco, S.Pecoraro IBM : F.Conti, S.De Santis, S.Fini RZ Garching : H.Reuter SGI : L.Bagnaschi, P.Barbieri, A.Mattioli (*) Project Coordinator

3 A.Maslennikov - May SLAB update3 Sponsors for these test sessions: ACAL Storage Networking: Loaned a 16-port Brocade switch ADIC Soiftware: Provided the StorNext file system product, actively participated in tests DataDirect Networks : Loaned an S2A 8000 disk system, actively participated in tests E4 Computer Engineering: Loaned 10 assembled biprocessor nodes Emulex Corporation: Loaned 16 fibre channel HBAs IBM : Loaned a FASTt900 disk system and SANFS product complete with 2 MDS units, actively participated in tests Infortrend-Europe : Sold 4 EonStor disk systems at discount price INTEL : Donated 10 motherboards and 20 CPUs SGI: Loaned the CXFS product Storcase : Loaned an InfoStation disk system

4 A.Maslennikov - May SLAB update4 Contents Goals Components under test Measurements: - SATA/FC systems - SAN File Systems - AFS Speedup - Lustre (preliminary) - LTO2 Final remarks

5 A.Maslennikov - May SLAB update5 1.Performance of low-cost SATA/FC disk systems 2.Performance of SAN File Systems 3.AFS Speedup options 4.Lustre 5.Performance of LTO-2 tape drive Goals for these test series

6 A.Maslennikov - May SLAB update6 Disk systems: 4x Infortrend EonStor A16F-G1A2 16 bay SATA-to-FC arrays: Maxtor Maxline Plus II 250 GB SATA disks (7200 rpm) Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 2x IBM FAStT900 dual controller arrays with SATA expansion units: 4 x EXP100 expansion units with 14 Maxtor SATA disks of the same type Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 1x StorCase InfoStation 12 bay array: same Maxtor SATA disks Dual Fibre Channel outlet at 2 Gbit Cache: 256 MB 1x DataDirect S2A 8000 System: 2 controllers with 74 FC disks of 146GB 8 Fibre Channel outlets at 2 Gbit Cache: 2.56 GB Components

7 Infortrend EonStor A16F-G1A2 - Two 2Gbps Fibre Host Channels - RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD - Multiple arrays configurable with dedicated or global hot spares - Automatic background rebuild - Configurable stripe size and write policy per array - Up to 1024 LUNs supported - 3.5", 1" high 1.5Gbps SATA disk drives - Variable stripe size per logical drive - Up to 64TB per LD - Up to 1GB SDRAM

8 FAStT900 Storage Server - 2 Gbps SFP - Expansion units: EXP700 FC / EXP100 sATA - Four SAN (FW-SW), or eight direct (FC-AL) - Four (redundant) 2 Gbps drive channels - Capacity: min 250GB – max 56TB (14 disks x EXP100 sATA) min 32GB – max 32TB (14 disks x EXP700 FC) - Dual-active controllers - Cache: 2GB - RAID support 0, 1, 3, 5, 10 FAStT900 EXP100

9 STORCase Fibre-to-SATA - SATA and Ultra ATA/133 Drive Interface - 12 hot swappable drives - Switched or FC-AL host connections - RAID levels: 0, 1, 0+1, 3, 5, 30, 50 and JBOD - Dual Fibre 2Gbps host ports - Support up to 8 arrays and 128 LUNs - Up to 1GB PC200 DDR cache memory

10 DataDirect S²A Single 2U S2A8000 with Four 2Gb/s Ports or Dual 4U with Eight 2Gb/s Ports - Up to 1120 Disk Drives; 8192 LUNs supported - 5TB to 130TB with FC Disks, 20TB to 250TB with SATA disks - Sustained Performance well over 1GB/s (1.6 GB/s theoretical) - Full Fibre-Channel Duplex Performance on every port - PowerLUN 1 GB/s+ individual LUNs without host-based striping - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality - Real time Any to Any Virtualization - Very fast rebuild rate

11 A.Maslennikov - May SLAB update11 - High-end Linux units for both servers and clients Biprocessor Pentium IV Xeon 2.4+ GHz, 1GB RAM Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre Channel HBAs - Network 2x Dell 5224 GigE switches - SAN Brocade 3800 switch – 16 ports (test series 1) Qlogic Sanbox 5200 – 32 ports (test series 2) - Tapes 2x IBM Ultrium LTO2 (3580-TD2, Rev: 36U3 ) Components

12 Qlogic SANbox 5200 Stackable Switch - 8, 12 or 16 auto-detecting 2Gb/1Gb device ports with 4-port incremental upgrade - Stacking of up to 4 units for 64 available user ports - Interoperable with all FC SW-2 compliant Fibre Channel switches - Full-fabric, public-loop or switch-to-switch connectivity on 2Gb or 1Gb front ports - "No-Wait" routing - guaranteed maximum performance independent of data traffic - Support traffic between switches, servers and storage at up to 10Gb/s - Low cost: 5200/16p is at least twice less expensive than Brocade 3800/16p - May be upgraded in 8p steps

13 IBM LTO Ultrium 2 Tape Drive Features GB Native Capacity (400 GB compressed) - 35 MB/s native (70 MB/s compressed) - Read/Write LTO 1 Cartridge - Native 2Gb FC Interface - Backward read/write with Ultrium 1 cartridge - 64 MB buffer (vs 32 MB buffer in Ultrium 1) - Speed Matching, Channel Calibration Tracks vs. 384 Tracks in Ultrium MB Buffer vs. 32 MB in Ultrium 1 - Enhanced Capacity (200GB) - Enhanced Performance (35 MB/s) - Backward Compatible - Faster Load/Unload Time, Data Access Time, Rewind Time

14 A.Maslennikov - May SLAB update14 SATA / FC Systems

15 A.Maslennikov - May SLAB update15 Typical array features: - single o dual (active-active) controller - up to 1GB of Raid Cache - battery to keep the cache afloat during power cuts - 8 through 16 drive slots - cost: 4-6 KUSD per 12/16 bay unit (Infortrend, Storcase) Case and backplane directly impact on the disks lifetime: - protection against inrush currents - protection against the rotational vibration - orientation (H better than V – remark by A.Sansum) Infortrend EonStor: well engineered (removable controller module, lower vibration, H orientation) Storcase: special protection against inrush currents (soft-start drive power circuitry), low vibration SATA / FC Systems – hw details

16 A.Maslennikov - May SLAB update16 High capacity ATA/SATA disk drives: - 250GB (Maxtor, IBM), 400GB (Hitachi) - RPM: improved quality: warranty 3 years, component design lifetime : 5 years CASPUR experience with Maxtor drives: - In 1.5 years lost 5 drives out of ~100, 2 of which due to power cuts - Factory quality for recent Maxtor Maxline Plus II 250 GB disks: out of 66 disks purchased, 4 were shortly replaced. Others stand the stress very well Learned during this meeting: - RAL annual failure rate is 21 out of 920 Maxtor Maxline drives SATA / FC Systems – hw details

17 A.Maslennikov - May SLAB update17 SATA / FC Systems – test setup Parameters to select / tune: - stripe size for RAID-5 - SCSI queue depth on controller and on Qlogic HBAs - number of disks per logical drive In the end, we were working with RAID-5 LUNs composed of 8 HDs each Stripe size: 128K (and 256K, in some tests) 4x IFT A16F- G1A2 4x IBM FASTt 900 Storcase Infostation Qlogic 2x x2.4+ GHz Nodes Qlogic 2310F HBA Dell 5224

18 A.Maslennikov - May SLAB update18 Kernel settings: - Kernels: smp, XFS1.3.1smp - vm.bdflush: vm.max(min)-readahead: 256(127) (large streaming writes) 4(3) (random reads with small blksize) File Systems: - EXT3 (128k RAID-5 stripe size): fs options: -m O –j –J size=128 –R stride=32 –T largefile4 mount options: data=writeback - XFS (128k RAID-5 stripe size): fs options: -i size=512 –d agsize=4g,su=128k,sw=7,unwritten=0 –l su=128k mount options: logbsize=262144,logbufs=8 SATA / FC tests – kernel and fs details

19 A.Maslennikov - May SLAB update19 Large serial writes and reads: - lmdd from lmbench suite: typical invocation: lmdd of=/fs/file bs=1000k count=8000 fsync=1 Random reads: - Pileup benchmark designed to emulate the disk activity for multiple data analysis jobs 1) series of 2GB files are being created in the desination directory 2) these files are then being read in a random way, in many threads SATA / FC tests – benchmarks used

20 A.Maslennikov - May SLAB update20 EXT3 results – filling 1.7 TB with 8GB files IFT systems show anomalous behaviour with EXT3 file system: performance varies along the file system. The effect visibly depends on the RAID-5 stripe size: SATA / FC results 32K 128k 256K ! The problem was reproduced and understood by Infortrend New firmware is due in July

21 A.Maslennikov - May SLAB update21 IBM FAStT and Storcase behave in a more predictable manner with EXT3. Both these systems may however lose up to 20% in performance along the file system: SATA / FC results

22 A.Maslennikov - May SLAB update22 XFS results – filling 1.7 TB with 8GB files The situation changes radically with this file system. The curves are now becoming almost flat, everything is much faster compared with EXT3: SATA / FC results IBM STORCASE INFORTREND Infortrend and Storcase show compatible write speeds of about MB/sec, IBM is much slower on writes (below 100 MB/sec). Read speeds are visibly higher thanks to the read-ahead function of controller (IBM and IFT systems had 1 GB of raid cache, Storcase had only 256 MB)

23 A.Maslennikov - May SLAB update23 Pileup tests: These tests were done only on IFT and Storcase systems. Results to a large extent depend on the number of threads that access the previously prepared files (after a certain number of threads performance may drop since the test machines may have problems to handle many threads at a time). The best result was obtained with the Infortrend array for XFS file system: SATA / FC results Number of threads EXT3, MB/secXFS, MB/sec StorcaseInfortrendStorcaseInfortrend

24 A.Maslennikov - May SLAB update24 Operation in degraded mode: We have tried it on a single Infortrend LUN of 5HDs and EXT3. One of the disks was removed, and rebuild process was started. The Write speed went down from 105 to 91 MB/sec The Read speed went down from 105 to 28 MB/sec and even less SATA / FC results

25 A.Maslennikov - May SLAB update25 1) The recent low-cost SATA-to-FC disk arrays (Infortrend, Storcase) operate very well and are able to deliver excellent I/O speeds far exceeding that of Gigabit Ethernet. Cost of such systems may be as low as 2.5 USD/rawGB. Quality of these systems is dominated by the quality of SATA disks. 2) The choice of local file system is fundamental. XFS easily outperforms EXT3. In one occasion we have observed an XFS hang under a very heavy load. xfs_repair was run, and the error had never reappeared again. We are now planning to investigate this in deep. CASPUR AFS and NFS servers are all XFS-based, and there was only one XFS-related problem since we have put XFS in production 1.5 years ago. But probably we were simply lucky. SATA / FC results - conclusions

26 A.Maslennikov - May SLAB update26 SAN File Systems

27 A.Maslennikov - May SLAB update27 SAN FS Placement These advanced distributed file systems allow clients to operate directly with block devices (block-level file access). Metadata traffic: via GigE. Required: Storage Area Network. Current cost of a single fibre channel connection > 1000 USD: Switch port, min ~ 500 USD including GBIC Host Based Adapter, min ~ 800 USD Special discounts for massive purchases are not impossible, but it is very hard to imagine that the cost of connection will become less than USD in the close future.. SAN FS with native fibre channel connection is still not an option for large farms. SAN FS with iSCSI connection may be re-evaluated in combination with new iSCSI-SATA disk arrays. SAN File Systems

28 A.Maslennikov - May SLAB update28 Where SAN File Systems with FC connection may be used: 1) High Performance Computing – fast parallel I/O, faster sequential I/O 2) Hybrid SAN / NAS systems: relatively small number of SAN clients acting as (also redundant) NAS servers 3) HA Clusters with file locking : Mail (shared pool), Web etc SAN File Systems

29 A.Maslennikov - May SLAB update29 So far, we have tried these products: 0) Sistina GFS (see our 2002 and 2003 reports) 1) ADIC StorNext File System 2) IBM SANFS (StorTank) (preliminary, continue looking into it) 3) SGI CXFS (work in progress) SAN File Systems

30 A.Maslennikov - May SLAB update30 FSPlatforms MDS host required MAX FS size GFS Server-Client: Linux32/64 No2 TB StorNext Server-Client: Aix, Linux, Solaris, Irix, Windows Nopetabytes StorTank Server: Linux32 Client: Aix, Linux, Windows, Solaris Yespetabytes CXFS Server: Irix/Linux64 Client: Irix, Solaris, Aix, Windows, Linux, OsX Yes Esabytes Linux32: 2TB SAN File Systems

31 A.Maslennikov - May SLAB update31 What was measured (StorNext and StorTank): 1) Aggregate write and read speeds on 1, 7 and 14 clients 2) Aggregate Pileup speed on 1,7, and 14 clients accessing: A) different sets of files B) same set of files During these tests we used 4 LUNS of 13 HDs each as recommended by IBM For each SAN FS we have tried both IFT and FAStT disk systems SAN File Systems 4x IFT A16F- G1A2 4x IBM FASTt 900 Qlogic 2x x2.4+ GHz Nodes Qlogic 2310F HBA Dell 5224 IA32 IBM StorTank MDS Origin 200 CXFS MDS

32 A.Maslennikov - May SLAB update32 SAN File Systems 1 Client7 Clients14 Clients IBMIFTIBMIFTIBMIFT Write Read Large sequential files: StorNext and StorTank behave in a similar manner on writes. StorNext does better on reads. IBM disk systems are performing better than IFT on reads for multiple clients: 1 Client7 Clients14 Clients IBMIFTIBMIFTIBMIFT Write Read IBM StorTank ADIC StorNext All numbers in MB/sec

33 A.Maslennikov - May SLAB update33 SAN File Systems Threads 1 Client7 Clients14 Clients IBMIFTIBMIFTIBMIFT 32 A A B10023 Pileup tests: StorTank is definitevely outperforming StorNext for this type of benchmark. The results are very interesting as it comes out that peak Pileup speeds with StorTank on a single client may reach the GigE speed (case of IFT disk): Threads 1 Client7 Clients14 Clients IBMIFTIBMIFTIBMIFT 32 A A B3110 IBM StorTank ADIC StorNext ! Unstable for IFT with more than 1 client All numbers in MB/sec

34 A.Maslennikov - May SLAB update34 CXFS experience: MDS: on SGI Origin 200 with 1 GB of RAM (IRIX ), 4 IFT arrays First numbers were not so bad, but with 4 clients or more the system becomes unstable (when they are used all at a time, one client will hang). That is what we have observed so far: SAN File Systems N of ClientsSeq. WriteSeq. Read 162 MB/s130 MB/s 291 MB/s245 MB/s 3117 MB/s306 MB/s We are currently investigating the problem together with SGI.

35 A.Maslennikov - May SLAB update35 StorNext on DataDirect system SAN File Systems EXT2, 8 distinct LUNS R/W, MB/sec StorNext, 2 Power LUNS R/W, MB/sec 1140 / / / / / x S2A FC outlets 2x Brocade x2.4+ GHz Nodes Emulex LP9xxx HBAs Dell S2A 8000 came with FC disks, although we asked for SATA - Quite easy in configuration, extremly flexible - Multiple levels of redundancy, small declared performance degradation on rebuilds - We ran only large serial wrirte and read 8GB lmdd tests using all the available power:

36 A.Maslennikov - May SLAB update36 - Performance of a SAN File System is quite close to that of disk hardware it is built upon (case of native FC connection). - StorNext is easiest in configuration. It does not require a standalone MDS. Works smoothly with all kinds of disk systems, fc switches etc We were able to export it via NFS, but with the loss of 50% of available bandwidth. iSCSI=? - StorTank is probably the most solid implementation of SAN FS, and it has a lot of useful options. It delivers the best numbers for random reads, and probably may be considered as a good candidate for relatively small clusters with native FC connection destinated for express data analysis. May have issues with 3rd party disks. Supports iSCSI. - CXFS uses the very performant XFS base and hence should have a good potential, although the 2 TB file system size on Linux/32bit is a real limitation (same is true for GFS). Some functions like MDS fencing require particular hardware. iSCSI=? - MDS loads: small for StorNext, CXFS and quite high for StorTank. SAN File Systems – some remarks

37 A.Maslennikov - May SLAB update37 AFS Speedup

38 A.Maslennikov - May SLAB update38 - AFS performance for large files is quite poor (max MB/sec even on a very performant hardware). To a large extent this is due to the limitations of Rx RPC protocol, and to the not most optimal implementation of the file server. - One possible workaround is to replace the Rx protocol with an alternative one in all cases where it is used for file serving. We were evaluating two such experimental implementations: 1) AFS with OSD support (Rainer Toebbicke). Rainer stores AFS data inside the Object-based Storage Devices (OSDs) which should not necessarily reside inside the AFS File Servers. The OSD performs basic space management and access control and is implemented as Linux daemon in user space on an EXT2 file system. AFS file server acts only as an MDS. 2) Reuters Fast AFS (Hartmut Reuter). In this approach, AFS partitions (/vicepXX) are made visible on the clients with fast SAN or NAS mechanism. As in the case 1), AFS file sever acts as an MDS and directs the clients to the right files inside the /vicepXX for faster data acess. AFS speedup options

39 A.Maslennikov - May SLAB update39 Both methods worked! The AFS/OSD scheme was tested during the Fall 2003 test session, the tests were done with the DataDirects S2A 8000 system. In one particular test we were able to achieve 425 MB/sec write speed for both native EXT2 and AFS/OSD configurations. The Reuter AFS was evaluated during the Spring 2004 session. StorNext SAN File System was used to distribute a /vicepX partition among several clients. Like in the previous case, AFS/Reuter performance was practically equal to the native performance of StorNext for large files. To learn more on the DataDirect system and the Fall 2003 session, please visit the following site: AFS speedup options

40 A.Maslennikov - May SLAB update40 Lustre!

41 A.Maslennikov - May SLAB update41 - Lustre We used 4 Object Storage Targets on 4 Infortrend arrays, no striping - Very interesting numbers for sequential I/O (8GB files, MB/sec): Lustre – preliminary results N of ClientsSeq. WriteSeq. Read These numbers may be directly compared with SAN FS results obtained with the same disk arrays: N of ClientsSeq. WriteSeq. Read StorTank, StorNext, StorTank, StorNext,

42 A.Maslennikov - May SLAB update42 LTO-2 Tape Drive

43 A.Maslennikov - May SLAB update43 The drive is a Factor 2 evolution of its predecessor, LTO-1. According to the specs, it should be able to deliever up to 35 MB/sec native I/O speed, and 200 GB of native capacity. We were mainly interested to check the following (see next page): - write speed as a function of block size - time to write a tape mark - positioning times The overall judgement: quite positive. The drive fits well for backup applications, and is acceptable for staging systems. Its strong point Is definitively a relatively low cost (10-11 KUSD) which makes it quite competitive (cmp with ~30 KUSD for STK 9940B). LTO-2 tape drive

44 A.Maslennikov - May SLAB update44 Write speed as a function of blocksize: > 31 MB/sec native for large blocks, very stable LTO-2 Tape mark writing is rather slow, sec/TM Positioning: it may take up to 1.5 minutes to fsf to the needed file (Average= 1minute)

45 A.Maslennikov - May SLAB update45 Final remarks Our immediate plans include: - Further investigation of StorTank, CXFS and yet another SAN file system (Veritas) including NFS export - Evaluation of iSCSI-enabled SATA RAID arrays in combination with SAN file systems - Further Lustre testing on IFT and IBM hardware (new version: 1.2, striping, other benchmarks) Feel free to join us at any moment !

Download ppt "ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004."

Similar presentations

Ads by Google