Investigation of Storage Systems for use in Grid Applications 1/20 Investigation of Storage Systems for use in Grid Applications ISGC 2012 Feb 27, 2012.

Slides:



Advertisements
Similar presentations
Jun 29, 20111/13 Investigation of storage options for scientific computing on Grid and Cloud facilities Jun 29, 2011 Gabriele Garzoglio & Ted Hesselroth.
Advertisements

Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
1 Distributed Systems Meet Economics: Pricing in Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Ceph vs Local Storage for Virtual Machine 26 th March 2015 HEPiX Spring 2015, Oxford Alexander Dibbo George Ryall, Ian Collier, Andrew Lahiff, Frazer Barnsley.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Minerva Infrastructure Meeting – October 04, 2011.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Jun 29, 20101/25 Storage Evaluation on FG, FC, and GPCF Jun 29, 2010 Gabriele Garzoglio Computing Division, Fermilab Overview Introduction Lustre Evaluation:
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
File System Benchmarking
Optimizing Performance of HPC Storage Systems
Mar 24, 20111/17 Investigation of storage options for scientific computing on Grid and Cloud facilities Mar 24, 2011 Keith Chadwick for Gabriele Garzoglio.
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
100G R&D at Fermilab Gabriele Garzoglio (for the High Throughput Data Program team) Grid and Cloud Computing Department Computing Sector, Fermilab Overview.
Grid Developers’ use of FermiCloud (to be integrated with master slides)
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Investigation of Storage Systems for use in Grid Applications 1/20 Investigation of Storage Systems for use in Grid Applications ISGC 2012 Feb 27, 2012.
Atlas Tier 3 Virtualization Project Doug Benjamin Duke University.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop on HEPiX storage test bed at FZK Artem Trunov.
Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05
CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Arne Wiebalck -- VM Performance: I/O
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing,
Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.
A Silvio Pardi on behalf of the SuperB Collaboration a INFN-Napoli -Campus di M.S.Angelo Via Cinthia– 80126, Napoli, Italy CHEP12 – New York – USA – May.
First Look at the New NFSv4.1 Based dCache Art Kreymer, Stephan Lammel, Margaret Votava, and Michael Wang for the CD-REX Department CD Scientific Computing.
BlueArc IOZone Root Benchmark How well do VM clients perform vs. Bare Metal clients? Bare Metal Reads are (~10%) faster than VM Reads. Bare Metal Writes.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
A Scalable Distributed Datastore for BioImaging R. Cai, J. Curnutt, E. Gomez, G. Kaymaz, T. Kleffel, K. Schubert, J. Tafas {jcurnutt, egomez, keith,
Nov 18, 20091/15 Grid Storage on FermiGrid and GPCF – Activity Discussion Grid Storage on FermiGrid and GPCF Activity Discussion Nov 18, 2009 Gabriele.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
Auxiliary services Web page Secrets repository RSV Nagios Monitoring Ganglia NIS server Syslog Forward FermiCloud: A private cloud to support Fermilab.
Performance analysis comparison Andrea Chierici Virtualization tutorial Catania 1-3 dicember 2010.
Authentication, Authorization, and Contextualization in FermiCloud S. Timm, D. Yocum, F. Lowe, K. Chadwick, G. Garzoglio, D. Strain, D. Dykstra, T. Hesselroth.
FermiCloud Project: Integration, Development, Futures Gabriele Garzoglio Associate Head Grid & Cloud Computing Department Fermilab Work supported by the.
Hao Wu, Shangping Ren, Gabriele Garzoglio, Steven Timm, Gerard Bernabeu, Hyun Woo Kim, Keith Chadwick, Seo-Young Noh A Reference Model for Virtual Machine.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
GPCF* Update Present status as a series of questions / answers related to decisions made / yet to be made * General Physics Computing Facility (GPCF) is.
FermiCloud Review Response to Questions Keith Chadwick Steve Timm Gabriele Garzoglio Work supported by the U.S. Department of Energy under contract No.
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
Lyon Analysis Facility - status & evolution - Renaud Vernet.
Getting the Most out of Scientific Computing Resources
Experience of Lustre at QMUL
Getting the Most out of Scientific Computing Resources
Solid State Disks Testing with PROOF
Diskpool and cloud storage benchmarks used in IT-DSS
Running virtualized Hadoop, does it make sense?
Experience of Lustre at a Tier-2 site
Grid Canada Testbed using HEP applications
Investigation of Storage Systems for use in Grid Applications
Overview Context Test Bed Lustre Evaluation Standard benchmarks
Implementation of a small-scale desktop grid computing infrastructure in a commercial domain    
Presentation transcript:

Investigation of Storage Systems for use in Grid Applications 1/20 Investigation of Storage Systems for use in Grid Applications ISGC 2012 Feb 27, 2012 Keith Chadwick for Gabriele Garzoglio Grid & Cloud Computing Dept., Computing Sector, Fermilab Overview Test bed and testing methodology Tests of general performance Tests of the metadata servers Tests of root-based applications HEPiX storage group result Operations and fault tolerance

2/20 Investigation of Storage Systems for use in Grid Applications Acknowledgements Ted Hesselroth, Doug Strain – IOZone Perf. measurements Andrei Maslennikov – HEPiX storage group Andrew Norman, Denis Perevalov – Nova framework for the storage benchmarks and HEPiX work Robert Hatcher, Art Kreymer – Minos framework for the storage benchmarks and HEPiX work Steve Timm, Neha Sharma, Hyunwoo Kim – FermiCloud support Alex Kulyavtsev, Amitoj Singh – Consulting This work is supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

3/20 Investigation of Storage Systems for use in Grid Applications Context Goal  Evaluation of storage technologies for the use case of data intensive jobs on Grid and Cloud facilities at Fermilab.  The study supports and informs the procurement and technology decisions of the computing facility at Fermilab.  The results of the study support the growing deployment of Lustre at the Lab, yet maintaining our BlueArc infrastructure. Technologies considered  Lustre (v1.8.3 / kernel ); Hadoop Distributed File System (HDFS v ); BlueArc (BA); Orange FS (PVFS2 v2.8.4) Targeted infrastructures:  FermiGrid, FermiCloud, and the General Physics Computing Farm. Collaboration at Fermilab:  FermiGrid / FermiCloud, Open Science Grid Storage team, Data Movement and Storage, Running Experiments More info at

4/20 Investigation of Storage Systems for use in Grid Applications Evaluation Method Set the scale: measure storage metrics from running experiments to set the scale on expected bandwidth, typical file size, number of clients, etc.   family.html family.html Install storage solutions on FermiCloud testbed Measure performance  run standard benchmarks on storage installations  study response of the technology to real-life (skim) applications access patterns (root-based)  use HEPiX storage group infrastructure to characterize response to Intensity Frontier (IF) applications Fault tolerance: simulate faults and study reactions Operations: comment on potential operational issues

5/20 Investigation of Storage Systems for use in Grid Applications Test Bed: “Bare Metal” Clnt / Srvr 2 TB 6 Disks eth FCL : 3 Data & 1 Name node ITB / FCL Ext. Clients (7 nodes - 21 VM) BA mount Dom0: - 8 CPU - 24 GB RAM Lustre Server - ITB clients vs. Lustre “Bare Metal” Lustre on SL : 3 OSS (different striping) CPU: dual, quad core Xeon 2.67GHz with 12 MB cache, 24 GB RAM Disk: 6 SATA disks in RAID 5 for 2 TB + 2 sys disks ( hdparm  MB/sec ) 1 GB Eth + IB cards CPU: dual, quad core Xeon 2.66GHz with 4 MB cache; 16 GB RAM. 3 Xen VM SL5 per machine; 2 cores / 2 GB RAM each.

6/20 Investigation of Storage Systems for use in Grid Applications Test Bed: On-Board vs. External Clnts 2 TB 6 Disks eth FCL : 3 Data & 1 Name node ITB / FCL Ext. Clients (7 nodes - 21 VM) BA mount Dom0: - 8 CPU - 24 GB RAM On Board Client VM 7 x Opt. Storage Server VM - ITB clients vs. Lustre Virtual Server - FCL clients vs. Lustre V.S. - FCL + ITB clients vs. Lutre V.S. 8 KVM VM per machine; 1 cores / 2 GB RAM each. ( )

7/20 Investigation of Storage Systems for use in Grid Applications Data Access Tests IOZone  Writes (2GB) file from each client and performs read/write tests.  Setup: 3-60clients on Bare Metal (BM) and 3-21 VM/nodes. Root-based applications  Nova: ana framework, simulating skim app – read large fraction of all events  disregard all (read-only)  Minos: loon framework, simulating skim app – data is compressed  access CPU-bound (does NOT stress storage) NOT SHOWN  Writes are CPU-bound: NOT SHOWN  Setup: 3-60clients on Bare Metal and 3-21 VM/nodes. MDTest  Different FS operations on up to 50k files / dirs using different access patterns.  Setup: clients on 21 VM.

8/20 Investigation of Storage Systems for use in Grid Applications Lustre Performance (iozone) Read: same for Bare Metal and VM srv (w/ virtio net drv.) Read: OnBrd clts 15% slower than Ext. clts (not significant) Write: Bare Metal srv 3x faster than VM srv Striping has a 5% effect on reading, none on writing. No effect changing number of cores on Srv VM Lustre Srv. on Bare Metal Lustre Srv. on VM (SL5 and KVM) Read Write How well does Lustre perform with servers on Bare Metal vs. VM ? Note: SL6 may have better write performance

9/20 Investigation of Storage Systems for use in Grid Applications BlueArc Performance (iozone) How well do VM clients perform vs. Bare Metal clients? Bare Metal Reads are (~10%) faster than VM Reads. Bare Metal Writes are (~5%) faster than VM Writes. Note: results vary depending on the overall system conditions (net, storage, etc.) Do we need to optimize transmit eth buffer size (txqueuelen) for the client VMs (writes) ? Eth interfacetxqueuelen Host1000 Host / VM bridge500, 1000, 2000 VM1000 WRITE: For these “reasonable” values of txqueuelen, we do NOT see any effect. Varying the eth buffer size for the client VMs does not change read / write BW MB/s read MB/s write

10/20 Investigation of Storage Systems for use in Grid Applications Hadoop Performance (iozone) How well do VM clients perform vs. Bare Metal clients? Is there a difference for External vs. OnBoard clients? How does number of replica change performance? On-Brd Bare Metal cl. reads gain from kernel caching, for a few clients. For many clients, same or ~50% faster than On-Brd VM & ~100% faster than Ext. VM clients. On-Brd VM Cl. 50%-150% faster than Ext. VM clients. Multiple replicas have little effect on read BW. On-Brd Bare Metal cl. Writes gain from kernel caching: generally faster than VM cl. On-Brd VM cl. 50%-200% faster than Ext. VM clients. All VM write scale well with number of clients. For Ext. VM cl., write speed scales ~linearly with number of replicas.

11/20 Investigation of Storage Systems for use in Grid Applications OrangeFS Performance (iozone) How well do VM clients perform vs. Bare Metal clients? Is there a difference for External vs. OnBoard clients? How does number of name nodes change performance? On-Brd Bare Metal cl. read almost as fast as OnBrd VM (faster config.), but 50% slower than VM cl. for many processes: possibly too many procs for the OS to manage. On-Brd VM Cl. read 10%-60% faster than Ext. VM clients. Using 4 name nodes improves read performance by 10%-60% as compared to 1 name node (different from write performance). Best performance when each name node serves a fraction of the clients. On-Brd Bare Metal cl. write are 80% slower than VM cl.: possibly too many processes for the OS to manage. Write performance NOT consistent. On-Brd VM cl. generally have the same performance as Ext. VM clients. One reproducible 70% slower write meas. for Ext. VM (4 name nodes when each nn serves a fraction of the cl.) Using 4 name nodes has 70% slower perf. for Ext. VM (reproducible). Different from read performance.

12/20 Investigation of Storage Systems for use in Grid Applications MetaData Srv Performance Comparison Hadoop Name Node scales better than OrangeFS Using 3 meta- data testing access patterns Unique each client performs metadata operations on a directory and files unique for that client, all at the same time Shared one client creates a set of test files, then all clients perform operations on those shared files at the same time Single all clients perform operations on files unique for that client, but all files are located in a single shared directory. Metadata Rates Comparison for File / Dir stat (ops/sec) mdtest – 21VM (7VM x 3 Hosts) - “Single” access pattern # Clients Hadoop (1 replica) OrangeFS (4nn / 3 dn) Lustre on Bare Metal

13/20 Investigation of Storage Systems for use in Grid Applications Lustre Perf. – Root-based Benchmark (21 clts) Read – Ext Cl.. vs. virt. srv. BW =  0.08 MB/s (1 ITB cl.: 15.3  0.1 MB/s) Read – Ext. Cl. vs BM srv. BW =  0.06 MB/s (1 cl. vs. b.m.: 15.6  0.2 MB/s) Virtual Clients as fast as BM for read. On-Brd Cl. 6% faster than Ext. cl. (Opposite as Hadoop & OrangeFS ) Virtual Server is almost as fast as Bare Metal for read Read – On-Brd Cl. vs. virt. srv. BW =  0.05 MB/s (1 FCL cl.: 14.4  0.1 MB/s) 4MB stripes on 3 OST Read – Ext. Cl. vs. virt. srv. BW =  0.01 MB/s 4MB stripes on 3 OST Read – OnBrd Cl. vs. virt. srv. BW =  0.03 MB/s MDT OSTs Data Striping: More “consistent” BW Non-Striped Bare Metal (BM) Server: baseline for read Ext.Cl. OnBrd Cl. 49 clts saturate the BW: is the distrib. fair? Clients do NOT all get the same share of the bandwidth (within 20%). Does Striping affect read BW? Read performance: how does Lustre Srv. on VM compare with Lustre Srv. on Bare Metal for External and On-Board clients?

Investigation of Storage Systems for use in Grid Applications 14/20 3 Slow Clnts at 3 MB/s How well do VM clients perform vs. Bare Metal (BM) clients? Read BW is essentially the same on Bare Metal and VM. Note: NOvA skimming app reads 50% of the events by design. On BA, OrangeFS, and Hadoop, clients transfer 50% of the file. On Lustre 85%, because the default read-ahead configuration is inadequate for this use case. 21 VM Clts Outliers consistently slow 21 BM Clts 49 VM Clts BlueArc Perf. – Root-based Benchmark Root-app Read Rates: 21 Clts: 8.15  0.03 MB/s ( Lustre:  0.06 MB/s Hadoop: ~7.9  0.1 MB/s )

15/20 Investigation of Storage Systems for use in Grid Applications Hadoop Perf. – Root-based Benchmark Root-app Read Rates: ~7.9  0.1 MB/s ( Lustre on Bare Metal was  0.06 MB/s Read ) Ext (ITB) clients read ~5% faster then on-board (FCL) clients. Number of replicas has minimal impact on read bandwidth. How does read BW vary for On-Brd vs. Ext. clnts? How does read bw vary vs. number of replicas? Ext and On-Brd cl. get the same share of the bw among themselves (within ~2%). At saturation, Ext. cl. read ~10% faster than On-Brd cl. (Same as OrangeFS. Different from Lustre) 49 clts (1 proc. / VM / core) saturate the BW to the srv. Is the distribution of the BW fair? Read time up to 233% of min. time (113% w/o outliers) 10 files / 15 GB: min read time: 1117s

16/20 Investigation of Storage Systems for use in Grid Applications OrangeFS Perf. – Root-based Benchmark 10 files / 15 GB: min read time: 948s Read time up to 125% of min. time Ext cl. get the same share of the bw among themselves (within ~2%) (as Hadoop). On-Brd. cl. have a larger spread (~20%) (as Lustre virt. Srv.). At saturation, on average Ext. cl. read ~10% faster than On-Brd cl. (Same as Hadoop. Different from Lustre virt. Srv.) 49 clts (1 proc. / VM / core) saturate the BW to the srv. Is the distribution of the BW fair? Ext (ITB) clients read ~7% faster then on-board (FCL) clients (Same as Hadoop. Opposite from Lustre virt. Srv.). How does read BW vary for On-Brd vs. Ext. clnts?

17/20 Investigation of Storage Systems for use in Grid Applications HEPiX Storage Group Collaboration with Andrei Maslennikov (*) Nova offline skim app. used to characterize storage solutions. Clients read and write events to storage (bi-directional test). Run 20,40,60,80 jobs per 10-node cluster (2,4,6,8 jobs per node); each of the jobs is processing a dedicated non-shared set of event files; Start and stop all jobs simultaneously after some predefined period of smooth running; * HEPiX Storage Working Group – Oct 2011, Vancouver - Andrei Maslennikov Lustre with AFS front-end for caching has best performance (AFS/VILU), although similar to Lustre

18/20 Investigation of Storage Systems for use in Grid Applications Comparison of fault tolerance / operations Lustre (v1.8.3 / kernel )  Server: special Kernel w/ special FS. Client: Kernel module  operational implications.  FS configuration managed through MGS server. Need look-ahead tuning when using root files.  Fault tolerance: turning off 1,2,3 OST or MDT for 10 or 120 sec on Lustre Virt. Srv. during iozone tests  2 modes: Fail-over vs. Fail-out. Lost data when transitioning.  Graceful degradation: If OST down  access suspended; If MDT down  ongoing access is NOT affected Hadoop (v )  Server: Java process in user space on ext3. Client: fuse kernel module: not fully POSIX.  Configuration managed through files. Some issues configuring number of replicas for clients.  Block-level replicas make the system fault tolerant. OrangeFS (v2.8.4)  Server process in user space on ext3. Client: Kernel module.  Configuration managed through files.  Limited documentation and small community. Log files have error messages hard to interpret  Data spread over data nodes w/o replicas: no fault tolerance if one node goes down. BlueArc  Server: dedicated HW with RAID6 arrays. Client: mounts via NFS.  Production quality infrastructure at Fermilab. More HW / better scalability than test bed for the other solutions.

19/20 Investigation of Storage Systems for use in Grid Applications Results Summary StorageBenchmarkRead (MB/s)Write (MB/s)Notes Lustre IOZone (70 on VM) Root-based12.6- Hadoop IOZone Varies on num of replicas Root-based7.9- BlueArc IOZone Varies on sys. conditions Root-based8.4- OrangeFS IOZone Varies on num name nodes Root-based8.1-

20/20 Investigation of Storage Systems for use in Grid Applications Conclusions Lustre on Bare Metal has the best performance as an external storage solution for the root skim app use case (fast read / little write). Consistent performance for general operations (tested via iozone).  Consider operational drawback of special kernel  On-board clients only via virtualization, but Srv. VM allows only slow write. Hadoop, OrangeFS, and BlueArc have equivalent performance for the root skim use case. Hadoop has good operational properties (maintenance, fault tolerance) and a fast name srv, but performance is NOT impressive. BlueArc at FNAL is a good alternative for general operations since it is a well known production quality solution. The results of the study support the growing deployment of Lustre at Fermilab, while maintaining our BlueArc infrastructure.

EXTRA SLIDES

22/20 Investigation of Storage Systems for use in Grid Applications Bandwidth Fairness - Comparison 3 Slow Clnts at 3 MB/s BA No Saturation Ext.Cl. OnBrd Cl. OrangeFS Ext: 2% spread OnBrd: 20% spread Ext.Cl. OnBrd Cl. Hadoop Ext / OnBrd: 2% spread Ext.Cl. OnBrd Cl. Lustre: Ext / OnBrd: 20% spread