Optimizing Performance of HPC Storage Systems

Slides:



Advertisements
Similar presentations
Advanced Lustre® Infrastructure Monitoring (Resolving the Storage I/O Bottleneck and managing the beast) Torben Kling Petersen, PhD Principal Architect.
Advertisements

Distributed Processing, Client/Server and Clusters
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Metadata Performance Improvements Presentation for LUG 2011 Ben Evans Principal Software Engineer Terascala, Inc.
© 2010 VMware Inc. All rights reserved Confidential Performance Tuning for Windows Guest OS IT Pro Camp Presented by: Matthew Mitchell.
Architecture and Implementation of Lustre at the National Climate Computing Research Center Douglas Fuller National Climate Computing Research Center /
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
On evaluating GPFS Research work that has been done at HLRS by Alejandro Calderon.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
1 I/O Management in Representative Operating Systems.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
© Copyright 2010 Hewlett-Packard Development Company, L.P. 1 HP + DDN = A WINNING PARTNERSHIP Systems architected by HP and DDN Full storage hardware and.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
File System Benchmarking
Early Experiences with NFS over RDMA OpenFabric Workshop San Francisco, September 25, 2006 Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
1 - Q Copyright © 2006, Cluster File Systems, Inc. Lustre Networking with OFED Andreas Dilger Principal System Software Engineer
Small File File Systems USC Jim Pepin. Level Setting  Small files are ‘normal’ for lots of people Metadata substitute (lots of image data are done this.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
DONE-08 Sizing and Performance Tuning N-Tier Applications Mike Furgal Performance Manager Progress Software
03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking.
SAN DIEGO SUPERCOMPUTER CENTER SDSC's Data Oasis Balanced performance and cost-effective Lustre file systems. Lustre User Group 2013 (LUG13) Rick Wagner.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Active Storage Processing in Parallel File Systems Jarek Nieplocha Evan Felix Juan Piernas-Canovas SDM CENTER.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Page 1 Mass Storage 성능 분석 강사 : 이 경근 대리 HPCS/SDO/MC.
Parallel IO for Cluster Computing Tran, Van Hoai.
Tackling I/O Issues 1 David Race 16 March 2010.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Ddn.com © 2015 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future.
Testing the Zambeel Aztera Chris Brew FermilabCD/CSS/SCS Caveat: This is very much a work in progress. The results presented are from jobs run in the last.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Experience of Lustre at QMUL
Achieving the Ultimate Efficiency for Seismic Analysis
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
The demonstration of Lustre in EAST data system
Scaling Spark on HPC Systems
Experience of Lustre at a Tier-2 site
Distributed Systems CS
High-Performance Storage System for the LHCb Experiment
Distributed Systems CS
Database System Architectures
Presentation transcript:

Optimizing Performance of HPC Storage Systems Torben Kling Petersen, PhD Principal Architect High Performance Computing

Current Top10 ….. All Top10 systems are running Lustre 1.8.x Rank Name Computer Site Total Cores Rmax Rpeak Power (KW) File system Size Perf 1 Tianhe-2 TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi National Super Computer Center in Guangzhou 3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s 2 Titan Cray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x DOE/SC/Oak Ridge National Laboratory 560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s 3 Sequoia BlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect DOE/NNSA/LLNL 1572864 17173224 20132659 7890 55 PB 850 GB/s 4 K computer Fujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect RIKEN AICS 705024 10510000 11280384 12659 40 PB 965 GB/s 5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom DOE/SC/Argonne National Lab. 786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s N/A BlueWaters Cray XK7, Opteron 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x NCSA - 24 PB 1100 GB/s 6 Piz Daint Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x Swiss National Supercomputing Centre (CSCS) 115984 6271000 7788853 2325 2.5 PB 138 GB/s 7 Stampede PowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi TACC/ Univ. of Texas 462462 5168110 8520112 4510 14 PB 150 GB/s 8 JUQUEEN BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect Forschungs zentrum Juelich (FZJ) 458752 5008857 5872025 2301 5.6 PB 33 GB/s 9 Vulcan BlueGene/Q, Power 16C 1.6GHz, Custom Intercon. 393216 4293306 5033165 1972 10 SuperMUC iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR Leibniz Rechenzentrum 147456 2897000 3185050 3423 10 PB 200 GB/s All Top10 systems are running Lustre 1.8.x

Performance testing Lies Bigger lies Benchmarks

Storage benchmarks IOR IOzone Bonnie++ spgdd-survey obdfilter-survey FIO dd/xdd Filebench dbench Iometer MDstat metarates …….

Lustre® Architecture – High Level Object Storage Servers (OSS) 1-1,000s Object Storage Target (OST) Client Router OSS Disk arrays & SAN Fabric disk CIFS Client Gateway NFS Client OSS disk Support multiple network types IB, X-GigE Client OSS disk … Client Metadata Servers (MDS) MDS MDS Lustre Client 1-100,000 Metadata Target (MDT) disk

Dissecting benchmarking

The chain and the weakest link … Non-blocking fabric ? TCP-IP Overhead ?? Routing ? SAS port over subscription Cabling … RAID controller (SW/HW) Client Memory OSS Memory disk interconnect bus network bus bus CPU Memory PCI-E bus FS client MPI stack SAS Controller/Expander PCI-E bus CPU Memory OS Interconnect Disk drive perf RAID sets SAS or S-ATA File system ?? Only a balanced system will deliver performance …..

Server Side Benchmark Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance This is a good benchmark to isolate network and client from the server. Example of obdfilter-survey parameters [root@oss1 ~]# nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey Parameters Defined size=65536         // file size (2x Controller Memory is good practice) nobjhi=1 nobjlo=1     // number of files thrhi=256 thrlo=256     // number of worker threads when testing OSS If you see results significantly lower than what is expected, rerun the test multiple times to ensure those results are not consistent. This benchmark can also target individual OSTs if we see an OSS node performing lower than expected, it can be because of a single OST performing lower due to drive issue, RAID array rebuilds, etc. [root@oss1 ~]# targets=“fsname-OST0000 fsname-OST0002” nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey

Client Side Benchmark IOR uses MPI-IO to execute the benchmark tool across all nodes and mimics typical HPC applications running on Clients Within IOR, one can configure the benchmark for File-Per-Process, and Single-Shared-File File-Per-Process: Creates a unique file per task and most common way to measure peak throughput of a Lustre parallel Filesystem Single-Shared-File: Creates a Single File across all tasks running on all clients Two primary modes for IOR Buffered: This is default and takes advantage of Linux page caches on the Client DirectIO: Bypasses Linux page caching and writes directly to the filesystem

Typical Client Configuration At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking Example for 200 Clients Number of Cores per Client: 16 (# nproc) Amount of Memory per Client 32GB (cat /proc/meminfo)

IOR Rule of Thumb Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: (200 Clients*32 GB)*2 = 12,800 GB Total file size for the IOR benchmark will be 12.8 TB NOTE: Typically all nodes are uniform.

Lustre Configuration

Lustre Server Caching Description Lustre read_cache_enable controls whether data read from disk during a read request is kept in memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (read_cache_enable = 1). Lustre writethrough_cache_enable controls whether data sent to the OSS as a write request is kept in the read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (writethrough_cache_enable = 1) Lustre readcache_max_filesize controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than readcache_max_filesize will not be kept in cache for either reads or writes. Default is all file sizes are cached.

Client Lustre Parameters Network Checksums Default is turned on and impacts performance. Disabling this is first thing we do for performance LRU Size Typically we disable this parameter Parameter used to control the number of client-side locks in an LRU queue Max RPCs in Flight Default is 8, increase to 32 RPC is remote procedure call This tunable is the maximum number of concurrent RPCs in flight from clients. Max Dirty MB Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight. Defines the amount of MBs of dirty data can be written and queued up on the client

Lustre Striping Default Lustre Stripe size is 1M and stripe count is 1 Each file is written to 1 OST with a stripe size of 1M When multiple files are created and written, MDS will do best effort to distribute the load across all available OSTs The default stripe size and count can be changed. Smallest Stripe size is 64K and can be increased by 64K and stripe count can be increased to include all OSTs Changing stripe count to all OSTs indicates each file will be created using all OSTs. This is best when creating a single shared file from multiple Lustre Clients One can create multiple directories with various stripe sizes and counts to optimize for performance

Experimental setup & Results

Equipment used ClusterStor with 2x CS6000 SSUs Clients Lustre 2TB NL-SAS Hitachi Drives 4U CMU Neo 1.2.1, HF applied Clients 64 Clients, 12 Cores, 24GB Memory, QDR Mellanox FDR core switch Lustre Client: 1.8.7 Lustre Server version: 2.1.3 4 OSSes 16 OSTs (RAID 6)

Subset of test parameters Disk backend testing – obdfilter-survey Client based testing – IOR I/O mode I/O Slots per client IOR transfer size Number of Client threads Lustre tunings writethrough cache enabled read cache enabled read cache max filesize = 1M Client Settings LRU Disabled Checksums Disabled MAX RPCs in Flight = 32

Lustre obdfilter-survey # pdsh -g oss "TERM=linux thrlo=256 thrhi=256 nobjlo=1 nobjhi=1 rsz=1024K size=32768 obdfilter-survey" cstor01n04: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3032.89 [ 713.86, 926.89] rewrite 3064.15 [ 722.83, 848.93] read 3944.49 [ 912.83,1112.82] cstor01n05: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3022.43 [ 697.83, 819.86] rewrite 3019.55 [ 705.15, 827.87] read 3959.50 [ 945.20,1125.76] This means that a single SSU have a write performance of 6,055 MB/s (75,9 MB/s per disk) read performance of 7,904 MB/s (98.8 MB/s per disk )

Buffered I/O

Direct I/O np=512 = 8 threads per client

Summary

Reflections on the results Never trust marketing numbers … Testing all stage of the data pipeline is essential Optimal parameters and/or methodology for read and write are seldom the same Real life applications can often be configured accordingly Balanced architectures will deliver performance Client based IOR performs within 5% of backend In excess of 750 MB/s per OST … -> 36 GB/s per rack … A well designed solution will scale linearly using Lustre cf. NCSA BlueWaters

Optimizing Performance of HPC Storage Systems Torben_Kling_Petersen@xyratex.com Thank You