Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences Jos van Wezel Institute for Scientific Computing Karlsruhe, Germany jvw@iwr.fzk.de

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Overview Estimated sizes and needs GridKa today and roadmap Connection models Hardware choices Software choices LCG

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Scaling the tiers Tier 0: 2 PB disk 10 PB tape 6000 kSi (data collection, distribution to tier 1 Tier 1: 1 PB disk 10 PB tape 2000 kSi (data processing, calibration, archiving for tier 2, distribute to tier 2) Tier 2: 0.2 PB disk no tape 3000 kSi (dataselections, simulation, distribute to tier 3) Tier 3: location and or group specific 1 opteron today ~ 1 kSi

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GridKa growth

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Storage at GridKa GPFS via NFS to nodesdCache via dcap to nodes

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GridKa road map 2004-2005 expand and stabilize GPFS / NFS combination possibly install Lustre integrate dCache look for alternative to TSM if !! really needed Try SATA disks 2004-2007 decide path for Parallel FS and dCache decide Tape backend scale for LHC (200 – 300 MB/s continuous for some weeks)

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Tier 2 targets (source: G.Quast / Uni-KA) 5 MB per node throughput 300 nodes 1000 MB/s 200 TB overall disk storage

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Estimate your needs (1) can you charge for the storage? –influences choice between on-line and offline (tape) –classification of data (volatile, precious, high IO, low IO) how many nodes will access the storage simultaneously –Absolute number of nodes –Number of nodes that run a particular job –Job classification to separate accesses

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Estimate your needs (2) What kind of access (Read/Write/Transfer sizes) – ability to control access pattern Pre-staging Software tuning –job classification to influence access pattern spread via scheduler What size will the storage have eventually –use benefit of random access via large number of controllers –up till 4 TB or 100 MB/s one controller –need high speed disks

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Disc technology keys Disk areal density is larger then tape –disks are rigid Density growth rate for disks continues (but slower) –deviation from Moores law (same for CPU) Superparamagnetic effect is not yet influencing progress –the end has been in view since 20 years Convergence of costs for disk and tape stopped –still factor 4 to 5 difference Disks and tape will be there at least another 10 years

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Disk areal density vs. head – media spacing Hitachi Deskstar 7k400 (2004): 400GB, 61 Gb/in. 2 IBM RAMAC (1956): 5 MB, 2000kb/in. 2 Head to media spacing (nm) Areal density (Mb/in. 2 ) 10 4 10 2 10 -2 1 10 6 10 -3 10 -1 10 10 3 10 5

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft To SATA or not when compared to SCSI/FC Up to 4 times cheaper (3 k / TB vs. 10 k / TB) 2 times slower in Multi user environment (access time) Not really for 24/7 operation (more failures) Larger capacity per disk max: 140 GB SCSI, 400 GB SATA (today) No large scale experience Warranty of drives for only 1 or 2 years. GridKa uses SCSI, SAN and expensive controllers bad experiences with IDE NAS boxes (160 GB disks, 3Ware controllers) New products, with SATA disks and expensive controllers IO ops are more important then throughput for most accesses

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Network attached storage IO –path via the network IO –path locally Fibre Channel or SCSI

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft NAS example server with 4 dual SCSI busses –more then 1 GB/s transfer 4 x 2 SATA RAID boxes (16 * 250 GB) –~4 TB per bus 2 * 4 * 2 * 4 = 72 TB on a server. est 30 keuro or 35 keuro with point to point FC Not that bad.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN IO –path to each host via SAN or iSCSI

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN or Ethernet SAN has easier management –exchange of hardware without interruption –joining separate storage elements iSCSI needs separate net (SCSI over IP) Very scalable performance –via switches or directors 1 SCSI bus maxes at 320 MB/s –better than current FC, but FC is duplex –not a fabric –example follows ELVM for easier management Network block device Kernel 2.6 new 16 TB limit SAN is expensive (500 EURO HBA, 1000 EURO switch port) A direct connection limitation can be partly compensated via High Speed interconnect (InfiniBand,Myrinet etc) Tighly coupled cluster with InfiniBand. Can be used for FC too, depending on the FS software.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Combining FC and Infiniband

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Software to drive the hardware File systems –GPFS (IBM) (GridKa uses this, so does UNI-KA) –SAN-FS (IBM $$) supports a range of architectures –Lustre (HP $) (Uni-KA Rechenzentrum cluster) –PVFS (stability is rather low) –GFS (now RedHat) or OpenGFS –NFS Linux implementation is messy but RH 3.0 EL seems promising NAS boxes reach impressive throughput, are stable, easy management, grow as needed (NetApp, Exanet) –Terragrid (very new) (Almost-posix) access via library preload –write once / read many –changing a file means creating a new and deleting the old –not usable for all software (e.g. no DBMS!) –Examples Gridftp (gfal), (x)rootd (rfio), dCache (dcap/gfal/rfio)

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GPFS Stripes over n disks Linux and AIX or combined Max FS size 70 TB HSM option Scalable and very robust Easy management SAN or IP+SAN or IP only Add and remove storage on-line Vendor lock

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Accumulated throughput as function of number of nodes/raid-arrays (GPFS) MB/s Reading Writing

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN FS metadata server failover policy based management add and remove storage on line $$$

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft LUSTRE Object based LDAP config database Failover of OSTs Support for heterogeneous network e.g. InfiniBand Advanced security Open Source

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SRM Storage resource manager Glue between worldwide grid and local mass storage (SE) A storage element should offer: –GridFTP –An SRM interface –Information publication via MDS LCG has SRM2 almost …. ready, SRM1 in operation SRM is build upon known MSS (CASTOR, dCache, Jasmine) dCache implements SRM v1

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft User SRM interaction Legenda: LFN: Logical file name RMC: Replication metadata catalog GUID: Grid unique identifier RLC: Replica location catalog RLI: Replica location index RLC + RLI = RLS RLS: Replica location service SURL: Site URL TURL: Transfer URL

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft In short Loosely coupled cluster: Ethernet Tightly coupled cluster: InfiniBand From 100 to 200 TB: local attached, NFS and or RFIO Above 200 TB: SAN, cluster file system and RFIO HSM via dCache – Grid SRM interface –Tape TSM / GSI solution ?? or Vanderbilt Enstor

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Some encountered difficulties Prescribed chain of software revision levels –support is given only to those who live by the rules –disk -> controller -> hba -> driver -> kernel -> application Linux limitations –block addressability < 2^31 –number of LUs < 128 NFS on Linux is a running target –enhancements or fixes introduce almost always a new bugs –limited experience in large (> 100 clients) installations Storage units become difficult to handle –exchanging 1 TB and rebalancing of live 5 TB file system takes 20 hrs – restoring a 5 TB file system can take up to a week –Acquirement needs 1 FTE / 10^6

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Thank you for your attention

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.

Similar presentations

Presentation on theme: "Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.

Similar presentations

Presentation on theme: "Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences."— Presentation transcript:

Similar presentations

About project

Feedback