03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.

Slides:

Advertisements

Similar presentations

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari

Advertisements

IB in the Wide Area How can IB help solve large data problems in the transport arena.

Lesson 15 – INSTALL AND SET UP NETWARE 5.1. Understanding NetWare 5.1 Preparing for installation Installing NetWare 5.1 Configuring NetWare 5.1 client.

NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.

Module – 7 network-attached storage (NAS)

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.

Jun 29, 20101/25 Storage Evaluation on FG, FC, and GPCF Jun 29, 2010 Gabriele Garzoglio Computing Division, Fermilab Overview Introduction Lustre Evaluation:

Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

1 A Look at PVFS, a Parallel File System for Linux Will Arensman Anila Pillai.

Chapter 7: Using Windows Servers to Share Information.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

DMF Configuration for JCU HPC Dr. Wayne Mallett Systems Manager James Cook University.

1 A Look at PVFS, a Parallel File System for Linux Talk originally given by Will Arensman and Anila Pillai.

Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.

1 - Q Copyright © 2006, Cluster File Systems, Inc. Lustre Networking with OFED Andreas Dilger Principal System Software Engineer

Optimizing Performance of HPC Storage Systems

Florida Tier2 Site Report USCMS Tier2 Workshop Fermilab, Batavia, IL March 8, 2010 Presented by Yu Fu For the Florida CMS Tier2 Team: Paul Avery, Dimitri.

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.

Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,

Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.

Wide Area Network Access to CMS Data Using the Lustre Filesystem J. L. Rodriguez †, P. Avery*, T. Brody †, D. Bourilkov *, Y.Fu *, B. Kim *, C. Prescott.

D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.

DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.

Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.

KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop on HEPiX storage test bed at FZK Artem Trunov.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

OSG Abhishek Rana Frank Würthwein UCSD.

Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Chapter 8: Installing Linux The Complete Guide To Linux System Administration.

Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.

DCache/XRootD Dmitry Litvintsev (DMS/DMD) FIFE workshop1Dmitry Litvintsev.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

9/22/10 OSG Storage Forum 1 CMS Florida T2 Storage Status Bockjoo Kim for the CMS Florida T2.

StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group

New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1 Presenter: Patrick Fuhrmann dCache.org Patrick Fuhrmann, Paul Millar,

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

An Introduction to GPFS

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,

The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.

Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Chapter 7: Using Windows Servers

Configuring File Services

Experience of Lustre at QMUL

The Beijing Tier 2: status and plans

WP18, High-speed data recording Krzysztof Wrona, European XFEL

dCache “Intro” a layperson perspective Frank Würthwein UCSD

Experience of Lustre at a Tier-2 site

CERN Lustre Evaluation and Storage Outlook

CMS analysis job and data transfer test results

Introduction to Networks

Building a Database on S3

Presentation transcript:

03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu

03/03/09USCMS T2 Workshop2 Introduction; Lustre features; Lustre architecture and setup; Preliminary experience in using Lustre FIU to UF-HPC test Questions and outlook Summary Outline

03/03/09USCMS T2 Workshop3 Lustre filesystem, is a multiple-network, scalable, open- source cluster filesystem; Lustre components: - MDS(Meta Data Server): Manages the names and directories in the filesystem, not “real data”; - OSS(Object Storage Servers) - Contains OST(Object Storage Target) - Does the real work to store, receive, and send data - Lustre Clients Introduction

03/03/09USCMS T2 Workshop4 Lustre achieves high I/O performance through distributing the data objects across OSTs and allowing clients to directly interact with OSSs Lustre features MDT obj1obj2obj3 File open request File metadata Inode(obj1,obj2...) Metadata server OST1OST3 Object storage server OST2 Object storage server Lustre Client This is similar to Inode concept with list of blocks for filedata on a disk.

03/03/09USCMS T2 Workshop5 Lustre is POSIX(portable operating system interface) compliant, general purpose filesystem; IO aggregate bandwidth scales with number of OSSs; Storage capacity is the total of OSTs, can grow/shrink online; Automatic failover of MDS, automatic OST balancing; Single, coherent, synchronized namespace; Support user quota; Security: supports Access Control Lists (ACLs). Kerberos is being developed; - Not so good: need clients and servers to have an identical understanding of UIDs and GIDs; Good WAN access performance; Simultaneously support multiple network types (TCP, InfiniB, Myricom, Elan….); Lustre features (2)

03/03/09USCMS T2 Workshop6 Some technical numbers (from Sun whitepaper) MDS: 3,000 – 15,000 op/s OSS: ~1000 OSSs and multiple OSTs on each OSS; Maximum OST is 8TB/each; Scalability with size on a single system: - Production used: 1.9PB; - Deployed: 5PB; - Tested: 32PB (with 4000 OSTs); Client nodes: 25,000 nodes for a single production filesystem; IO aggregate rate can increase linearly with number of OSSs, best IO rate seen is >130GB/s (maximum seen at UF is 2GB/s); Lustre features (3)

03/03/09USCMS T2 Workshop7 Lustre architecture and setup

03/03/09USCMS T2 Workshop8 Typical setup MDS: 1-2 servers with good CPU and RAM, high seek rate; OSS: servers. Need good bus bandwidth, storage; Installation itself is simple Install the Lustre kernel and RPMs (download or build yourself); Setup Lustre modules and modify the /etc/modprobe.conf file; Format and mount the OST and MDT filesystems; Start the client with mount, similar to NFS mount (client can use patchless client without modifying of the kernel); Notes Can play with all the services(MDS,OSS) on a single node; Give some time to learn and get familiar with it: 3 months(?); Once it is up, manpower need is small; Lustre architecture and setup(2)

03/03/09USCMS T2 Workshop9 UF, FIU and FIT have been testing Lustre with CMS storage and analysis jobs since last year with a lot of help from UF HPC. We have basically tried with a couple of things: Using Lustre as dCache storage; Data access performance: test data access performance of CMS analysis jobs with data stored on Lustre filesystem and comparing with the performance using dcap; Test remote access performance from FIU and FIT; Preliminary experience in using Lustre

03/03/09USCMS T2 Workshop10 For dCache storage use, we have tried with using Lustre filesystem as tertiary storage (like tape) and directly as dCache pool; The transfer rate was able to reach over 130MB/s from a single Lustre backend pool node; Preliminary experience in using Lustre(2)

03/03/09USCMS T2 Workshop11 For CMS data access, files in Lustre can be integrated with CMS applications seamlessly without any modification: - Once Lustre filesystem is mounted, it acts just like you run your jobs accessing data at local disk; - The IO extensive job execution time can reach 2.6 time faster when accessing files directly through Lustre mounted filesystem comparing with accessing files of the same dataset using dcap protocol that are located at a dCache raid pool with xfs filesystem (the hardware are similar); - The execution time can be improved even with dcap protocol when the files are put on Lustre backend pools; - "Recent Storage Group report Hepix quotes number *2 better performance for jobs running on Lustre.“ --- Alex Kulyavtsev from FNAL; Preliminary experience in using Lustre(3)

03/03/09USCMS T2 Workshop12 Preliminary experience in using Lustre(4) Execution time comparison between directly Lustre access and dcap with xfs raid pool Execution time with dcap access using dCache pool on Lustre filesystem

03/03/09USCMS T2 Workshop13 Preliminary experience in using Lustre(5) Bockjoo did some further detailed comparison tests on CMSSW jobs using Lustre and dcap on striped files in dcache Lustre pool: - One can see the major delay comparing with Lustre and dcap read comes from the analysis time and from file open request to first data record read

03/03/09USCMS T2 Workshop14 Remotely, FIU (Jorge) has been able to run CMS application with directly mounted Lustre filesystem for data stored at UF HPC Lustre without any noticeable performance downgrade; UF and FIT have been testing the Lustre performance between our two sites and the performance has been only limited by our network connection. They are now able to access the CMS data stored at UF; - Good collaboration examples for T2 and T3 to share data and resources; Preliminary experience in using Lustre(6)

03/03/09USCMS T2 Workshop15 FIU to UF-HPC tests Systems used –analysis server- “medianoche.hep.fiu.edu Dual 4 core with 16GB RAM with dual NICs 1GigE- priv/pub Filesystems: (local), (NFS 16 TB 3ware RAID), (Lustre mounted 81 TB UFL-HPC CRN storage) –“gatekeeper” – dgt.hep.fiu.edu Dual 2 core with 2GB RAM single NIC 1GigE private (an ageing ’03 Xeon system) Also used in experiments mounting lustre over NAT on nodes on private subnet System configuration RHEL 4. (medianoche) and SL 5.0 (dgt) –Booted patched kernel EL_lustre on both system –Modified modeprob.conf –Mounted UF-HPC’s CRN on /crn/scratch Networking: –Connection to FLR via “dedicated link thru our CRN” border router –Hardware issues limit BW to ~ 600 Mbps from FIU’s CRN to FLR, RTT is 16ms –Server’s TCP buffer sizes set to max of 16 MB Configuration at FIU

03/03/09USCMS T2 Workshop16 FIU to UF-HPC tests(2) Networking: –Connection to FLR via “dedicated link ur CRN via 2x 10Gbps links RAID Inc. Falcon III Storage (104 TB Raw, 81 TB volume) –Six shelves each with 24 ea. SATAII drives with dual 4GBps FC RAID –Fronted by two dual 4 core servers with 3 FC cards dual ports, 16GB RAM, infiniband and 10 Gbps Chelsio NIC –Mounts to FIU via Lustre over TCP/IP –System can read/write natively at well over 1 GBps via TCP/IP Site synchronization and security –HPC Lustre systems configured to allow mounts to particular remote servers at FIU and FIT Access is granted to specific IP’s Other nodes on the site can do remote Lustre mount via NAT: known to work but not tested yet –Remote servers and HPC lustre share a common UID/GID domain Not a problem for Tier3 with dedicated servers and small user community –ACL’s and root_squash etc., available in this version of Lustre but not yet explored Configuration at UF-HPC

03/03/09USCMS T2 Workshop17 FIU to UF-HPC test (3) Ran tests on both medianoche & dgt –Filesize set to 2xRAM, –Checked with various record lengths –Checked with IOzone multi-processor mode up to 8 processes running simultaneously Results are robust Haven’t prepared plots yet… –Results consistent with dd read/write –CMSSW tested but I/O rates for a single job are a fraction of IOzone rates Conclusions: –Either hardware configuration with single or multi-stream can completely saturate campus network link to FLR via lustre! IOzone tests results

03/03/09USCMS T2 Workshop18 FIU to UF-HPC Use Cases Use case scenario 1: Provide distributed scratch space read/write –A convenient way of collaborating –Both Tier2 and Tier3 analysis servers have access to same scratch space –This is how we are currently using the FIU/UF-HPC space now Use case scenario 2: Provide access to data storage via Lustre –Read-only access to dedicated remote (Tier3) server –Eliminate the need to deploy CMS data management services at Tier3s Utilize resources including data managers at Tier2 No need to deploy hardware and services Phedex, SRM enabled storage… etc. –Tier3 compute resources could be used to access CMS data directly Via NAT enabled gateway or switch or Export remote Lustre mount via NFS to the rest of the Tier3 nodes. –This also allows interactive access to CMS data from a Tier3 remotely Sometime the only way to find the answer to a problem! –This has not yet been fully explored

03/03/09USCMS T2 Workshop19 Q: Can we use Lustre to simplify our dCache storage architecture, e.g., using Lustre as dCache pool storage? Q: What about putting an srm interface directly on Lustre filesystem? Q: Can we use Lustre as the primary source of CMS data sharing and distribution between T2 and T3? - Advantage: lower manpower/resource cost --- only fraction of data to access in CMS jobs Q: Use Lustre as home/data directory? Q: How about Lustre comparing with pNFS 4.1? Questions and outlook

03/03/09USCMS T2 Workshop20 Lustre has shown to have good performance, scalability and relatively easy to deploy and admin; CMS user analysis jobs have been able to run with the data stored on Lustre filesystem without any problems. And the performance can be significantly improved when the data are accessed through Lustre than being accessed through dCache directly; CMS T3 physicists have been able to share CMS data remotely located at UF T2 site; UF has been able to integrate Lustre with existing dCache infrastructure. There is a potential to simplify our storage architecture using Lustre. There are really a lot of potential with Lustre! Summary