Lustre File System Evaluation at FNAL CHEP'09, Prague March 23, 2009 Stephen Wolbers for Alex Kulyavtsev, Matt Crawford, Stu Fuess, Don Holmgren, Dmitry.

Slides:



Advertisements
Similar presentations
Performance, Reliability, and Operational Issues for High Performance NAS Matthew O’Keefe, Alvarri CMG Meeting February 2008.
Advertisements

1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)
NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.
Module – 7 network-attached storage (NAS)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Jun 29, 20101/25 Storage Evaluation on FG, FC, and GPCF Jun 29, 2010 Gabriele Garzoglio Computing Division, Fermilab Overview Introduction Lustre Evaluation:
File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
1 - Q Copyright © 2006, Cluster File Systems, Inc. Lustre Networking with OFED Andreas Dilger Principal System Software Engineer
Optimizing Performance of HPC Storage Systems
An Open Source approach to replication and recovery.
Small File File Systems USC Jim Pepin. Level Setting  Small files are ‘normal’ for lots of people Metadata substitute (lots of image data are done this.
Ceph Storage in OpenStack Part 2 openstack-ch,
HAMBURG ZEUTHEN DESY Site Report HEPiX/HEPNT Fermilab Knut Woller.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Your university or experiment logo here NextGen Storage Shaun de Witt (STFC) With Contributions from: James Adams, Rob Appleyard, Ian Collier, Brian Davies,
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Storage Tank in Data Grid Shin, SangYong(syshin, #6468) IBM Grid Computing August 23, 2003.
Wide Area Network Access to CMS Data Using the Lustre Filesystem J. L. Rodriguez †, P. Avery*, T. Brody †, D. Bourilkov *, Y.Fu *, B. Kim *, C. Prescott.
03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
Large Scale Parallel File System and Cluster Management ICT, CAS.
SAN DIEGO SUPERCOMPUTER CENTER SDSC's Data Oasis Balanced performance and cost-effective Lustre file systems. Lustre User Group 2013 (LUG13) Rick Wagner.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
Test Results of the EuroStore Mass Storage System Ingo Augustin CERNIT-PDP/DM Padova.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
First Look at the New NFSv4.1 Based dCache Art Kreymer, Stephan Lammel, Margaret Votava, and Michael Wang for the CD-REX Department CD Scientific Computing.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
CERN - European Organization for Nuclear Research FOCUS March 2 nd, 2000 Frédéric Hemmer - IT Division.
GridKa December 2004 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann dCache Implementation at FZK Forschungszentrum Karlsruhe.
A UK Computing Facility John Gordon RAL October ‘99HEPiX Fall ‘99 Data Size Event Rate 10 9 events/year Storage Requirements (real & simulated data)
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Nov 18, 20091/15 Grid Storage on FermiGrid and GPCF – Activity Discussion Grid Storage on FermiGrid and GPCF Activity Discussion Nov 18, 2009 Gabriele.
DCache/XRootD Dmitry Litvintsev (DMS/DMD) FIFE workshop1Dmitry Litvintsev.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
An Introduction to GPFS
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
BeStMan/DFS support in VDT OSG Site Administrators workshop Indianapolis August Tanya Levshina Fermilab.
E VALUATION OF GLUSTER AT IHEP CHENG Yaodong CC/IHEP
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Compute and Storage For the Farm at Jlab
Experience of Lustre at QMUL
NL Service Challenge Plans
Introduction to Data Management in EGI
Experience of Lustre at a Tier-2 site
CERN Lustre Evaluation and Storage Outlook
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
Large Scale Test of a storage solution based on an Industry Standard
Kirill Lozinskiy NERSC Storage Systems Group
Presentation transcript:

Lustre File System Evaluation at FNAL CHEP'09, Prague March 23, 2009 Stephen Wolbers for Alex Kulyavtsev, Matt Crawford, Stu Fuess, Don Holmgren, Dmitry Litvintsev, Alexander Moibenko, Stan Naymola, Gene Oleynik,Timur Perelmutov, Don Petravick, Vladimir Podstavkov, Ron Rechenmacher, Nirmal Seenu, Jim Simone Fermilab

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL2 Outline Goals of Storage Evaluation Project and Introduction General Criteria HPC Specific Criteria HSM Related Criteria HEP Specific Criteria Test Suite and Results Lustre at FNAL HPC Facilities: Cosmology and LQCD Conclusions

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL3 Storage Evaluation Fermilab's Computing Division regularly investigates global/high-performance file systems which meet criteria in: capacity, scalability and I/O performance data integrity, security and accessibility usability, maintainability, ability to troubleshoot & isolate problems tape integration namespace and its performance Produced a list of weighted criteria a system would need to meet (HEP and HPC) – FNAL CD DocDB Set up test stands to get experience with file systems and to perform measurements where possible

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL4 Storage Evaluation Additional input for evaluation: File system documentation Design and performance of existing installations Communications with vendor/organization staff Training The focus of this talk is our evaluation of Lustre: Most effort so far concentrated on general functionality and HPC (Cosmological Computing Cluster and Lattice QCD) We are in process of evaluating Lustre for HEP Lustre Installations: Preproduction and production systems on Cosmology Computation Cluster LQCD Cluster

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL55 Storage Evaluation Criteria Capacity of 5 PB scalable by adding storage units Aggregate I/O > 5 GB/s today scalable by adding I/O units LLNL BlueGene/L has 200,000 clients processes access 2.5PB Lustre fs through 1024 IO nodes over 1Gbit Ethernet network at 35 GB/sec Disk subsystem should impose no limit on sizes of files. The typical file used in HEP today are 1GB to 50GB Legend for the criteria: Means satisfies criteria  Means either doesn’t satisfy or partially satisfies criteria ? Not tested green - example exists purple – coming soon red – needs attention

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL6 Criteria: Functionality Storage capacity and aggregate data IO bandwidth scale linearly by adding scalable storage units Storage runs on general purpose platform. Ethernet is primary access medium for capacity computing  Easy to add, remove, replace scalable storage unit. Can work on mix of storage hardware. System scales up when units are replaced by ones with advanced technology. addition and replacement are fine  removal still has issues

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL7 Criteria: Namespace Provides hierarchical namespace mountable with POSIX access on apx nodes (apx. 25,000 nodes on RedStorm at SNL) Supports millions of online (and tape resident) files 74 LLNL. ? Client processes can open at least 100 files, and tens of thousands files can be open for read in the system. We have not tested this requirement, but Lustre’s metadata server does not limit the number of open files Supports hundreds metadata ops/sec without affecting I/O The measured meta data rate is by factor better  It must be possible to make a backup or dump of the namespace & metadata without taking the system down  We perform hourly LVM snapshots, but we really need the equivalent of transaction logging

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL8 Criteria : Scaling Data Transfers and Recovery  Must support at least 600 WAN and 6000 LAN transfers simultaneously, with a mixture of writes and reads (perhaps 1 to 4 ratio). One set must not starve the other  Must be able to control number of WAN and LAN transfers independently, and/or set limits for each transfer protocol Must be able to limit striping across storage units to contain the impact of a total disk failure to a small percentage of files Serving “hot files” to multiple clients may conflict with “less striping” We are interested in some future features: -file Migration for Space Rebalancing -set of tools for Information Lifecycle Management

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL9 Criteria: Data Integrity & Security  Support for Hashing or checksum algorithms. Adler32, CRC32 and more will be provided on Lustre v2.0. End-to-end data integrity will be provided by ZFS DMU  System must scan itself or allow or allow scanning for the silent file corruption without undue impact on performance under investigation Security over the WAN is provided by WAN protocols such as GridFTP, SRM, etc. We require communication integrity rather than confidentiality Kerberos support will be available in Lustre v2.0 ACLs, user/group quotas Space management (v2.x ? )

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL10 HPC Specific Criteria and Lustre Lustre on Computational Cosmology and LQCD clusters: Large, transparently (without downtime) extensible, hierarchal file system accessible through standard Unix system calls Parallel file access: File system visible to all executables, with possibility of parallel I/O Deadlock free for MPI jobs POSIX IO More stable than NFS (Computational Cosmology only) Ability to run on commodity hardware and Linux OS to reuse existing hardware

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL11 HSM-Related Criteria:  Integration of the file system site’s HSM (e.g., Enstore, CASTOR, HPSS) is required for use at large HEP installations The HSM shall provide transparent access to 10 to 100 PB of data on tape (growing in time) Must be able to create file stubs in Lustre for millions of files already existing in HSM in a reasonable time Automatically migrates designated files to tape Transparent file restore on open() Pre-stage large file sets from HSM to disk. Enqueue many read requests (current CMS FNAL T1 peak: 30,000) with O(100) active transfers Evict files already archived when disk space is needed

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL12 Lustre HSM Feature Lustre does not yet have HSM feature. Some sites implement simple tape backup schemes HSM integration feature is under development by CEA and Sun HSM version v1.0 – “Basic HSM” in a future release of Lustre — beta in fall 2009 ? -Integration with HPSS (v1), others will follow -Metadata scans to select files to store in HSM v1 -File store on close() on-write in HSM v2 Integration work Work specific to the HSM is required for integration

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL13 HEP and general Criteria and Lustre GridFTP server from Globus Toolkit v4.0.7 worked out of the box BeStMan SRM gateway server is installed on LQCD cluster Storm SRM performance on top of Lustre is reported on this conference Open source, training and commercial support available Issues reported to Lustre-discuss list are quickly answered

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL14 Lustre Tests at Fermilab Developed test harness and test suite to evaluate systems against criteria “Torture” tests to emulate large loads for large data sets metadata stressor – create millions of files Used standard tests against a Lustre filesystem: data I/O : IOZone, b_eff_io metadata I/O : fileop/IOZone, metarates, mdtest Pilot test system was used to get experience and validate system stability Initial throughputs were limited by the disks used for Lustre’s data storage. Subsequent installations used high-performance disk arrays and achieved higher speeds.

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL15 Lustre Test System Three to five client nodes Two data servers Each node : Dual CPU quad-core Xeon 16 GB mem Local SATA disk 500 GB

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL16 Lustre Metadata Rates Single client metadata rates measured with fileop/IOZone benchmark

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL17 Lustre Metadata Rates Aggregate metadata rate measured with metarates benchmark for one to 128 multiple clients running on 5 nodes * 8 cores

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL18 Lustre on Computational Cosmology Cluster Lustre DATA 2 SATABeasts = 72 TB RAID6 2* 12 LUNS = 2* 3 vol. *4 partitions 4 Lustre data Servers one shared with Metadata Server 250 GB on LVM 125 Compute nodes 1Gb Ethernet Stackable Switch SMS 8848M

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL19 Lustre on FNAL LQCD clusters 10 GigE LCC Bldg. GCC Bldg. IPoIB Lustre Infiniband network 1 MetaData Server 4 file servers 72 TB data RAID6 two 4Gbit FC per SATABeast 750 GB RAID1 Volatile dCache 72 TB

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL20 Lustre Experience - HPC From our experience in production on Computational Cosmology Cluster (starting summer 2008) and limited pre- production on LQCD JPsi cluster (December 2008) the Lustre File system: Lustre doesn’t suffer the MPI deadlocks of dCache direct access eliminates the staging of files to/from worker nodes that was needed with dCache (Posix IO) improved IO rates compared to NFS and eliminated periodic NFS server “freezes” reduced administration effort

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL21 Conclusions - HEP Lustre file system meets and exceeds our storage evaluation criteria in most areas, such as system capacity, scalability, IO performance, functionality, stability and high availability, accessibility, maintenance, and WAN access. Lustre has much faster metadata performance than our current storage system. At present Lustre can only be used for HEP applications not requiring large scale tape IO, such as LHC T2/T3 centers or scratch or volatile disk space at T1 centers. Lustre near term roadmap (about one year) for HSM in principle satisfies our HSM criteria. Some work will still be needed to integrate any existing tape system.

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL22 Backup Slides

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL23 Lustre Jargon Client – client node where user application runs. It talks to MetaData Server and data server (OSS) MDS – MetaData Server - the node, one active per system MDT – MetaData Target – disk storage for metadata, connected to MDS OSS – Object Store Server – the node serving data files or file stripes OST – Object Store Target – disk storage for data files or file stripes, connected to OSS

March 23, 2009CHEP'09: Lustre FS Evaluation at FNAL24 What is Lustre? Lustre Data Storage Data Servers Client Nodes (workers) Metadata Server