Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking.

Slides:



Advertisements
Similar presentations
Sonny J Zambrana University of Pennsylvania ISC-SEO November 2008.
Advertisements

Presentation Heading – font Arial
DSL-2730B, DSL-2740B, DSL-2750B.
DHCP -Ameeta and Haripriya -cmsc 691x. DHCP ► Dynamic Host Configuration Protocol ► It controls vital networking parameters of hosts with the help of.
1 Dynamic DNS. 2 Module - Dynamic DNS ♦ Overview The domain names and IP addresses of hosts and the devices may change for many reasons. This module focuses.
Intel® Manager for Lustre* Lustre Installation & Configuration
© 2012 Whamcloud, Inc. Distributed Namespace Status Phase I - Remote Directories Wang Di Whamcloud, Inc.
PC Cluster Setup on Linux Fedora Core 5 High Performance Computing Lab Department of Computer Science and Information Engineering Tunghai University, Taichung,
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
NDT Tools Tutorial: How-To setup your own NDT server Rich Carlson Summer 04 Joint Tech July 19, 2004.
A crash course in njit’s Afs
Va-scanCopyright 2002, Marchany Unit 3 – Installing Solaris Randy Marchany VA Tech Computing Center.
Ch 8-3 Working with domains and Active Directory.
Introduction to UNIX/Linux Exercises Dan Stanzione.
V Avon High School Tech Crew Agenda Old Business –Delete Files New Business –Week 10 Topics: Coming up: –Yearbook Picture: Feb 7 12:20PM.
Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.
Help session: Unix basics Keith 9/9/2011. Login in Unix lab  User name: ug0xx Password: ece321 (initial)  The password will not be displayed on the.
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
Networked File System CS Introduction to Operating Systems.
SUSE Linux Enterprise Server Administration (Course 3037) Chapter 4 Manage Software for SUSE Linux Enterprise Server.
Hands-On Virtual Computing
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Optimizing Performance of HPC Storage Systems
Guide to Linux Installation and Administration, 2e1 Chapter 8 Basic Administration Tasks.
CPS120: Introduction to Computer Science Operating Systems Nell Dale John Lewis.
Home Media Network Hard Drive Training for Update to 2.0 By Erik Collett Revised for Firmware Update.
1 Web Server Administration Chapter 3 Installing the Server.
HPC at HCC Jun Wang Outline of Workshop1 Overview of HPC Computing Resources at HCC How to obtain an account at HCC How to login a Linux cluster at HCC.
Module 1: Installing and Configuring Servers. Module Overview Installing Windows Server 2008 Managing Server Roles and Features Overview of the Server.
FTP Server and FTP Commands By Nanda Ganesan, Ph.D. © Nanda Ganesan, All Rights Reserved.
We will now practice the following concepts: - The use of known_hosts files - SSH connection with password authentication - RSA version 2 protocol key.
O.S.C.A.R. Cluster Installation. O.S.C.A.R O.S.C.A.R. Open Source Cluster Application Resource Latest Version: 2.2 ( March, 2003 )
Agenda Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Review next lab assignments Break Out Problems.
The Secure Shell Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See
PSeries Advanced Technical Support © 2002 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length:
Chapter Two Exploring the UNIX File System and File Security.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Linux Security. Authors:- Advanced Linux Programming by Mark Mitchell, Jeffrey Oldham, and Alex Samuel, of CodeSourcery LLC published by New Riders Publishing.
What is a port The Ports Collection is essentially a set of Makefiles, patches, and description files placed in /usr/ports. The port includes instructions.
Week Two Agenda Announcements Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Next lab assignments.
Linux Commands C151 Multi-User Operating Systems.
Project 4. “File System Implementation”
| nectar.org.au NECTAR TRAINING Module 9 Backing up & Packing up.
Introduction to UNIX CS 2204 Class meeting 1 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Linux Operations and Administration
FTP COMMANDS OBJECTIVES. General overview. Introduction to FTP server. Types of FTP users. FTP commands examples. FTP commands in action (example of use).
1 Running MPI on “Gridfarm” Bryan Carpenter February, 2005.
Lab 8 Overview Apache Web Server. SCRIPTS Linux Tricks.
Using MPI on Dept. Clusters Min LI Sep Outline Run MPI programs on single machine Run mpi programs on multiple machines Assignment 1.
Installing VERITAS Cluster Server. Topic 1: Using the VERITAS Product Installer After completing this topic, you will be able to install VCS using the.
Integrity Check As You Well Know, It Is A Violation Of Academic Integrity To Fake The Results On Any.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
How to setup DSS V6 iSCSI Failover with XenServer using Multipath Software Version: DSS ver up55 Presentation updated: February 2011.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
PRESENTED BY ALI NASIR BITF13M040 AMMAR HAIDER BITF13M016 SHOIAB BAJWA BITF13M040 AKHTAR YOUNAS BITF13M019.
Review Why do we use protection levels? Why do we use constructors?
Project 4. “File System Implementation”
GRID COMPUTING.
CS1010: Intro Workshop.
FTP Lecture supp.
DriveScale Proprietary Information © 2016
Introduction to the Junos Operating System
Getting SSH to Work Between Computers
Intro to UNIX System and Homework 1
Unix : Introduction and Commands
IS3440 Linux Security Unit 7 Securing the Linux Kernel
Xen Network I/O Performance Analysis and Opportunities for Improvement
Command line.
Working in The IITJ HPC System
Presentation transcript:

Client Configuration Lustre Benchmarking

Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking Base-lining for MDRAID obdfilter-survery IOR Agenda

Client Configuration for SSH Setting up ssh keys

All clients must have functioning SSH server which allows both direct root access and key based authentication. You still need to generate a master key on your head node, then copy this key into the ~/.ssh/authorized_keys file. Required ssh keys on Clients

Login to the Client head node Generate the master key on Client Head node # ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: 8c:d1:39:a7:68:a3:e1:5f:d9:95:b3:e6:13:6a:8e:cc client1.xyus.xyratex.com Read the generated public key from the head node # cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwZsS68UMWSXaybwAxnaHq30VIL0uM54VVgiJmTLZQ/qFhH0/GP6WTSUPk5U/eiRRc1Lhfp7AY3VWdKQ2wv084EMC+9uPu Fht9ugOPaPI4yVFYskZ+NNYKb6v07hGW10wD25jMPZ/omxsVx1cHt25KlDc+FA2Wj1mxK6x61vQayPxQh4WFHhCgM30TsllrAB9SHh37+ookHTeY8xpQpbunR GCyBrRFqVLcusnho4P5zZrtSrKlPLjKIy1kg43hVgzSk6ae5FVSvaYQmubQb1Q31ftrwne7zqCLjfhudkgsETBDJtteWZPFUpRZYpbtvOkfCqa/XiSrOY8Xc/ nxq0Dvw== How to setup SSH keys on the Client

Logon to each client and make the.ssh folder and copy the pub key generated on head node to each client. For example, looking at a sample configuration for one client, client2 ~]# ssh client2 password: Last login: Fri May 17 01:52: from ~]# mkdir -m 0700 ~/.ssh ~]# cat >> ~/.ssh/authorized_keys ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwZsS68UMWSXaybwAxnaHq30VIL0uM54VVgiJmTLZQ/qFhH0/GP6WTSUPk 5U/eiRRc1Lhfp7AY3VWdKQ2wv084EMC+9uPuFht9ugOPaPI4yVFYskZ+NNYKb6v07hGW10wD25jMPZ/omxsVx 1cHt25KlDc+FA2Wj1mxK6x61vQayPxQh4WFHhCgM30TsllrAB9SHh37+ookHTeY8xpQpbunRGCyBrRFqVLcus nho4P5zZrtSrKlPLjKIy1kg43hVgzSk6ae5FVSvaYQmubQb1Q31ftrwne7zqCLjfhudkgsETBDJtteWZPFUpR ZYpbtvOkfCqa/XiSrOY8Xc/nxq0Dvw== ~]# exit Use the key to configured all clients

Client Configuration for pdsh Setting up pdsh

Install pdsh on the head node # yum install –y pdsh /etc/hosts is already configured with hostname for you How to use pdsh for each Group to Confirm if SSH is configured correctly Group 1: # pdsh –w super[00-03] date Installing and using pdsh

Client Installation of MPI

OpenMPI is needed to execute IOR on all clients using the command ‘mpirun’ From the head node, run the following commands: pdsh -w client[1-12] ‘yum install -y openmpi openmpi-devel’ Add openmpi to your path pdsh -w client[1-12] ‘ldconfig /usr/lib64/openmpi/lib’ pdsh -w client[1-12] ’export PATH=$PATH:/usr/lib64/openmpi/bin:/usr/lib64/openmpi/lib’ Check Path before continuing pdsh -w client[1-12] ‘echo $PATH’ If Path is not correct, might need to look at shell and edit the.bashrc file for example and source the file, than copy to all clients export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib Install openmpi on all nodes

Client InfiniBand Installation Stock OFED

InfiniBand is essential to run IOR using IBoRDMA. From the head node, run the following command to install IB. ~]# pdsh –w client [1-12] ‘yum groupinstall -y "Infiniband Support”’ ~]# pdsh –w client [1-12] ‘yum install -y infiniband-diags’ Start RDMA and Bring Up ib0 interface ~]# pdsh –w client [1-12] ‘service rdma start’ ~]# pdsh –w client [1-12] ‘ifup ib0’ Install the OFED/InfiniBand Packages

Client Download and Installation of IOR

From the Head node, download build IOR To build IOR, need to install autoconf and make package, and development tools which installs the make utility ~]# yum install -y make autoconf ~]# yum groupinstall -y “Development Tools” Download the IOR tool ~]# wget sio/IOR%20latest/IOR /IOR tgz Download and build IOR

IOR is downloaded to the local directory, Untar/ungzip it and run the make utility to build IOR ~]# tar -zxvf IOR tgz Goto the IOR Directory that was untarred/unzipped IOR]# make (cd./src/C && make posix) make[1]: Entering directory `/root/IOR/src/C’ mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c IOR.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c utilities.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c parse_options.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-POSIX.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noMPIIO.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noHDF5.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noNCMPI.c mpicc -o IOR IOR.o utilities.o parse_options.o aiori-POSIX.o aiori-noMPIIO.o aiori-noHDF5.o aiori-noNCMPI.o \ -lm make[1]: Leaving directory `/root/IOR/src/C’ ~]# cp /root/IOR/src/C/IOR. Download and build IOR

Copy IOR to all Clients ~]# pdcp –w client[2-12] IOR /root/. Confirm IOR linked with the correct library ~]# pdsh -w client[1-12] "ldd IOR | grep mpi" client5: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x b ) client7: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x e6c00000) client8: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x a400000) client6: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x e ) client1: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x ebee00000) client3: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x dbe00000) client2: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x fe ) client11: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007fa ) client12: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f8c5bf6e000) client4: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x a ) client9: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f8f658e3000) client10: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f475a019000) Download and build IOR

Client Lustre Client Installation Installing Lustre Client RPMs

Two packages are required to be installed or built on the Clients lustre-client-modules-.rpm -- Lustre Patchless Client Modules lustre-client-.rpm -- Lustre utilities Needs to be confirmed through the site survey and work with ClusterStor Support organization to obtain the correct RPMs to install at the customer site If you don’t have a pre-build Lustre client for your particular client, than you will have to build the client using the SRC RPM package Download Clients from: Client Lustre Packages

© Xyratex 2013 Just to make sure: Unmount Lustre (if already mounted) # umount /mnt/lustre Unload Lustre Modules # lustre_rmmod Install Kernel development and Compilers ]# yum install –y kernel-devel libselinux-devel rpm-build ]# yum groupinstall –y “Development Tools” Download Lustre SRC RPM and install it # rpm -ivh --nodeps lustre-client _ el6.x86_64.src.rpm 1:lustre-client warning: user jenkins does not exist - using root warning: group jenkins does not exist - using root ########################################### [100%] warning: user jenkins does not exist - using root warning: group jenkins does not exist - using root The above output states the rpmbuild directory for this client is using /root, only a warning Building Lustre Clients

© Xyratex 2013 Go to the following directory and do the following SOURCES]# cd /root/rpmbuild/SOURCES/ SOURCES]# ls lustre tar.gz SOURCES]# gunzip lustre tar.gz SOURCES]# tar xvf lustre tar CD into the following directory: SOURCES]# cd lustre lustre-2.4.3]# pwd /root/rpmbuild/SOURCES/lustre Go to the SRC RPM Directory installation and run the command to build the RPM for your specific Kernel and assuming OS Stock OFED # make distclean #./configure --disable-server --with-linux=/usr/src/kernels/ el6.x86_64 # make && make rpms Building Lustre Clients

© Xyratex 2013 RPM just built can be found in the following directory: # /root/rpmbuild/RPMS/x86_64 Two packages are required to be installed or built on the Clients lustre-client-modules-.rpm -- Lustre Patchless Client Modules lustre-client-.rpm -- Lustre utilities Copy the built RPMs on the other Clients and install, if all clients are the same OS and Kernel # rpm –ivh lustre-client-modules _358.el6.x86_64.x86_64.rpm # rpm –ivh lustre-client _358.el6.x86_64.x86_64.rpm NOTE: can be or Install the Lustre Client RPMs

© Xyratex 2013 Edit or Create /etc/modprobe.d/lnet.conf options lnet networks=“o2ib0(ib0)” – For IB Nodes options lnet networks=“tcp(eth20)” – For Ethernet Nodes Install the two Lustre Client RPMs just build Start Lustre # modprobe lustre Possible start RDMA on IB system and bringup ib0 # service rdma start # ifup ib0 Configuration and Starting Lustre on Clients

# ssh ~]# cscli fs_info Information about "tsefs2" file system: Node Node type Targets Failover partnerDevices tsesys2n02 mgs 0 / 0 tsesys2n03 tsesys2n03 mds 1 / 1 tsesys2n02 /dev/md66 tsesys2n04 oss 1 / 1 tsesys2n05 /dev/md0 tsesys2n05 oss 1 / 1 tsesys2n04 /dev/md1 tsesys2n06 oss 1 / 1 tsesys2n07 /dev/md0 tsesys2n07 oss 1 / 1 tsesys2n06 /dev/md1 ~]# ssh tsesys2n02 'lctl list_nids' Mount Command from Clients: mount –t lustre CS6000 GridRAID System

# ssh ~]# cscli fs_info Information about "fs1" file system: Node Node type Targets Failover partnerDevices hvt1sys02 mgs 0 / 0 hvt1sys03 hvt1sys03 mds 1 / 1 hvt1sys02 /dev/md66 hvt1sys04 oss 4 / 4 hvt1sys05 /dev/md0, /dev/md2, /dev/md4, /dev/md6 hvt1sys05 oss 4 / 4 hvt1sys04 /dev/md1, /dev/md3, /dev/md5, /dev/md7 ~]# ssh hvt1sys02 'lctl list_nids' Mount Command from Clients: mount –t lustre /mnt/fs1 CS6000 MDRAID System

To find out the InfiniBand IP address and LNET name to mount Lustre on the client, logon to the MGS node that has MGT mounted and run the following command ~]# lctl list_nids – IP address of ib0 on MGS node (MGS and MDS can run on the same node or different nodes) o2ib0 – Default LNET network for RDMAoIB Good practice is use a mount option of MGS primary and secondary, to allow the clients to still access the filesystem in the event of MGS target failing over from primary to secondary node mount -t lustre /mnt/lustre Mount Lustre on the Clients

We first need to create the mount point, then issue the mount command pdsh -w client[1-200] ‘mkdir /mnt/lustre’ pdsh -w client[1-200] ‘mount -t lustre /mnt/lustre’ Check if all clients mounted successfully pdsh -w client[1-200] 'mount -t lustre' | wc –l 200 Check the state of the filesystem from one client with the following command. We have 36 OSS servers, the output should be 144 OSTs ~]# lfs check servers | grep OST | wc –l 146 Example of Mounting Lustre on All client

Client Lustre Tuning

Network Checksums Default is turned on and impacts performance. Disabling this is first thing we do for performance LRU Size Typically we disable this parameter Parameter used to control the number of client-side locks in an LRU queue Max RPCs in Flight Default is 8 RPC is remote procedure call This tunable is the maximum number of concurrent RPCs in flight from from clients. Max Dirty MB Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight. Defines the amount of MBs of dirty data can be written and queued up on the client Client Lustre Parameters

First thing to always do is disable Wire Checksums on the client and disable LRU max_rpcs_in_flight and max_dirty_mb are a product of number of clients available for benchmarks. Typically, we increase max_rpcs_in_flight to 32 for 1.8.9, and to 256 for 2.4.x/2.5.x Clients In some cases if we still don’t get performance, than we increase max_dirty_mb to 4x the current value for or the same as 2.4.x/2.5.1 Clients of max_rpcs_in_flight Procedures for Benchmarking 1.Disable Checksums 2.Disable LRU 3.Increase max_rpcs_in_flight for specific client 4.Increase max_dirty_mb for specific client Procedure to optimize Client Side Tuning

Disable Client Checksums with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 32 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 32 > $n; done' Disable LRU ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 128 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_dirty_mb; do echo 128 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client Client Lustre tuning Parameters

Disable Client Checksums with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 256 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 256 > $n; done' Disable LRU ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 256 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_dirty_mb; do echo 256 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. 2.4.x/2.5.x Client Lustre tuning Parameters

Based on the LUG 2014 Client Performance and Comparison, surprising results keeping Checksums Enabled only has about an upto 5% impact on performance Default algorithm is Adler32, but CRC32 is also available, and suggest using CRC32 due to HW support for acceleration on CPU technologies Enable Client Checksums with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 1 > $n; done’ Select CRC32 Client Checksum Type with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksum_type; do echo crc32 > $n; done’ A Note on Checksums

Jumbo Frames has a >= 30% improvement on Lustre Performance compared to standard MTU of 1500 Change MTU on Client and Servers to 9000 Change MTU on the Switches to 9214 (or max MTU size) to accommodate for payload overhead Never set the MTU on the switch the same on the Clients and Servers Ethernet Tuning

Server Side Benchmark for MDRAID Using obdfilter-survey

Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance This is a good benchmark to isolate network and client from the server This run from the primary management node Must run as root to execute obdfilter-survey on the OSS nodes. Server Side Benchmark

Before running obdfilter-survey, want to make sure all Targets are mounted on their primary servers. CS6000 Health ~]# pdsh -g lustre 'grep -c lustre /proc/mounts' | dshbak –c lmtest[ ] lmtest[ ] If the output is different from above, use HA to failover/failback resources to their primary servers before proceeding Check CS6000 Configuration for Health

nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey Parameters Defines Size=65536 // file size which is 2x Controller Memory nobjhi=1 nobjlo=1 // number of files thrhi=256 thrlo=256 // number of worker threads when testing OSS and SSU The results for each OSS should be in the range of 3000MB/s on write and 3500MB/s on read If you see results significantly lower, rerun the test multiple times to ensure those anomalies are not consistent. NOTE: obdfilter-survey is intrusive and requires to be run as root, and occasionally can induce a LBUG on the OSS node, don’t be alarmed. obdfilter-survey Setup for CS6000 MDRAID

~]# pdsh -g oss ’nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey' lmtest408: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest408 lmtest409: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest409 lmtest404: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest404 lmtest406: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest406 lmtest407: Sun May 19 17:01:48 PDT 2013 Obdfilter-survey for case=disk from lmtest407 lmtest405: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest405 lmtest406: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest409: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest408: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest407: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest406: done! lmtest408: done! lmtest409: done! lmtest407: done! lmtest405: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest404: ost 4 sz K rsz 1024K obj 4 thr 1024 write [ , ] rewrite [ , ] read [ , ] lmtest404: done! lmtest405: done! obdfilter-survey Results for CS6000 MDRAID with 256 worker threads

Server Side Benchmark for GridRAID Using obdfilter-survey

Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance This is a good benchmark to isolate network and client from the server This run from the primary management node Must run as root to execute obdfilter-survey on the OSS nodes. Server Side Benchmark

Before running obdfilter-survey, want to make sure all Targets are mounted on their primary servers. CS9000 Health ~]# pdsh -g lustre 'grep -c lustre /proc/mounts' | dshbak –c lmtest[ ] lmtest[ ] If the output is different from above, use HA to failover/failback resources to their primary servers before proceeding Check ClusterStor Configuration for Health

The pre-allocation size of LDISKFS is not set correct for GridRAID and will need to be changed for optimal performance. NOTE: This is fixed in ClusterStor 1.5 First, check the pre-allocation size on each OSS: ~]# pdsh -g oss 'cat /proc/fs/ldiskfs/md*/prealloc_table' tsesys2n04: tsesys2n06: tsesys2n05: tsesys2n07: To correct the pre-allocation size: ~]# pdsh -g oss 'echo " " > /proc/fs/ldiskfs/md*/prealloc_table’ ClusterStor 1.4 Tuning Change Required

nobjlo=1 nobjhi=1 thrlo=512 thrhi=512 size= obdfilter-survey Parameters Defines Size= // file size which is 2x Controller Memory nobjhi=1 nobjlo=1 // number of files thrhi=512 thrlo=512 // number of worker threads when testing OSS and SSU The results for each CS9000 OSS should be in the range of 4300MB/s on write and read The results for each CS6000 OSS should be in the range of 3100MB/s on writes, and 3700MB/s reads If you see results significantly lower, rerun the test multiple times to ensure those anomalies are not consistent. NOTE: obdfilter-survey is intrusive and requires to be run as root, and occasionally can induce a LBUG on the OSS node, don’t be alarmed. Use this for CS6000 and CS9000 SSUs obdfilter-survey Setup for ClusterStor GridRAID SSU

~]# pdsh -g oss 'nobjlo=1 nobjhi=1 thrlo=512 thrhi=512 size= obdfilter-survey' tsesys2n05: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n05 tsesys2n07: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n07 tsesys2n06: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n06 tsesys2n04: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n04 tsesys2n06: ost 1 sz K rsz 1024K obj 1 thr 512 write [ , ] rewrite [ , ] read [ , ] tsesys2n05: ost 1 sz K rsz 1024K obj 1 thr 512 write [ , ] rewrite [ , ] read [ , ] tsesys2n07: ost 1 sz K rsz 1024K obj 1 thr 512 write [ , ] rewrite [ , ] read [ , ] tsesys2n04: ost 1 sz K rsz 1024K obj 1 thr 512 write [ , ] rewrite [ , ] read [ , ] tsesys2n06: done! tsesys2n05: done! tsesys2n07: done! tsesys2n04: done! obdfilter-survey Results for GridRAID with 512 worker threads

Client Side Benchmark for MDRAID Typical IOR Client Configuration

At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. NOTE: Our configuration for this training is a bit unique and required additional thought to get performance per SSU With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking A minimum of 8 Clients per SSU MDRAID Typical Client Configuration

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: (8_clients*32GB_memory)*2 = 512GB Total file size for the IOR benchmark will be 512GB NOTE: Typically all nodes are uniform. IOR Rule of Thumb

mpirun is used to execute IOR across all clients from the head node Within mpirun we use the following options -machinefile machinefile.txt -np : total number of tasks to execute on all clients (e.g. –np 64 state 8 tasks per client with 8 clients --byslot -machinefile option is a simple text file listing all clients to execute the IOR benchmark -np defines the number of tasks --byslot defines how many tasks are executed on the first node before starting additional tasks on the second node, so on and so forth. This is tied to how the machinefile options are defined --bynode is another option which executes 1 task per node before executing additional tasks per node. Using MPI to execute IOR

Create a simple file called ‘machinefile.txt’ listing all the clients with the following options slots=4 max_slots=‘Max Number of CPU Cores’ In an example, 16 cores. Because we edited the /etc/host file with the client IP address, we only need to use the associated hostname for each client listed in the /etc/host file. This is also true if DNS is used at the customer site, no need to define node names in /etc/host, or one can use the IPv4 address in the machine file. Creating the machinefile on the head node

~]# vi machinefile.txt client1 slots=4 max_slots=16 client2 slots=4 max_slots=16 client3 slots=4 max_slots=16 client4 slots=4 max_slots=16 ………… client13 slots=4 max_slots=16 client14 slots=4 max_slots=16 client15 slots=4 max_slots=16 client16 slots=4 max_slots=16 :wq! With the Xyratex defined machinefile, and using --byslot option, we will use 4 slots the first node, then 4 slots in the second node, so on and so forth If using --bynode, we will round-robin the number of slots per node regardless of the machinefile configuration Slots = Tasks per node Sample machinefile for ClusterStor

Typical IOR Parameter for 16 nodes with 32GB of memory is /usr/lib64/openmpi/bin/mpirun -machinefile machinefile.txt –np 128 –byslot./IOR -v - F -t 1m –b 8g -o /mnt/lustre/test.`date+"%Y%m%d.%H%M%S“‘ -np 128 = all 16 nodes used with 4 slots (tasks) per node 8 slots per node (tasks) -b 8g = (2x32GB*16_Clients)/128_tasks Typically all nodes will be uniform, so we have to use lowest common denominator -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run -F : File per Process -t 1m: File transfer size of 1m -v : verbose output Defining IOR Parameters

Jumbo Frames has a >= 30% improvement on Lustre Performance compared to standard MTU of 1500 Change MTU on Client and Servers to 9000 Change MTU on the Switches to 9214 (or max MTU size) to accommodate for payload overhead Never set the MTU on the switch the same on the Clients and Servers Ethernet Tuning

Writes: Buffered IO -F : file per process -t 1m Default is Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better Reads: Direct IO -F: File per Process -t 64m -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better --byslot option Baseline IOR Performance for CS6000 MDRAID

Use the –w flag in IOR to for Only Write results with Buffered IOR -F : file per process -t 1m Write only operation: -w Default is Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR -F -t 1mb -b 16g - w -o /mnt/lustre/fpp/testw.out IOR Write for MDRAID

We first need to write the data, than read back using Direct IO -F: File per Process Use the write and read flag: -w -r -t 64m -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Reads mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb -b 16g –w -r -o /mnt/lustre/fpp/testr.out IOR Reads for MDRAID

Stonewall can be used to perform a Write or Read test under a fixed time in seconds to ensure only maximum performance is measured Removes unbalanced task completion that can effect performance results Good to use when Clients are non-uniform architecture If used, ensure to specify a much bigger block size in IOR (-b) and a long enough time to write or read 2x client memory Typically, I increase –b in IOR by a factor of 4x or more Advanced IOR Option: Stonewall

Use the –w flag in IOR to for Only Write results with Buffered IOR -F : file per process -t 1m Write only operation: -w Default is Buffered IO -D 240 (4 min) -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR -F -t 1mb –b 512g -w -o /mnt/lustre/fpp/testw.out –D 240 IOR Write for MDRAID w/ Stonewall

We first need to write the data, than read back using Direct IO - F: File per Process Use the write and read flag: -w -r -t 64m With Stonewall, we need to write without stonewall first, than read back in a separate IOR command using Stonewall -k: Keep the Write output test files to read back using Stonewall -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better IOR Reads for MDRAID w/ Stonewall

Step 1: Set Lustre Stripe Size and Stripe Count lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Step 2: IOR Write with Direct IO with large enough block size to read back at least 2x client memory and keep the output file (Want the write to complete) mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 64g –w -k -o /mnt/lustre/fpp/testr.out Step 3: IOR Read back the output test file from Step 2 with Stonewall Option mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 64g –r -o /mnt/lustre/fpp/testr.out – D 240 IOR Reads for MDRAID w/ Stonewall

Client Side Benchmark for GridRAID Typical IOR Client Configuration

At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. NOTE: Our configuration for this training is a bit unique and required additional thought to get performance per SSU With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking A minimum of 16 Clients per SSU GridRAID Typical Client Configuration

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: (16*32GB)*2 = 1024GB Total file size for the IOR benchmark will be 1024GB NOTE: Typically all nodes are uniform. IOR Rule of Thumb

mpirun is used to execute IOR across all clients from the head node Within mpirun we use the following options -machinefile machinefile.txt -np : total number of tasks to execute on all clients (e.g. –np 64 state 8 tasks per client with 8 clients --byslot -machinefile option is a simple text file listing all clients to execute the IOR benchmark -np defines the number of tasks --byslot defines how many tasks are executed on the first node before starting additional tasks on the second node, so on and so forth. This is tied to how the machinefile options are defined --bynode is another option which executes 1 task per node before executing additional tasks per node. Using MPI to execute IOR

Create a simple file called ‘machinefile.txt’ listing all the clients with the following options slots=4 max_slots=‘Max Number of CPU Cores’ In an example, 16 cores. Because we edited the /etc/host file with the client IP address, we only need to use the associated hostname for each client listed in the /etc/host file. This is also true if DNS is used at the customer site, no need to define node names in /etc/host, or one can use the IPv4 address in the machine file. Creating the machinefile on the head node

~]# vi machinefile.txt client1 slots=4 max_slots=16 client2 slots=4 max_slots=16 client3 slots=4 max_slots=16 client4 slots=4 max_slots=16 ………… client13 slots=4 max_slots=16 client14 slots=4 max_slots=16 client15 slots=4 max_slots=16 client16 slots=4 max_slots=16 :wq! With the Xyratex defined machinefile, and using --byslot option, we will use 4 slots the first node, then 4 slots in the second node, so on and so forth If using --bynode, we will round-robin the number of slots per node regardless of the machinefile configuration Slots = Tasks per node Sample machinefile for ClusterStor

Typical IOR Parameter for 16 nodes with 32GB of memory is /usr/lib64/openmpi/bin/mpirun -machinefile machinefile.txt –np byslot./IOR -v - F -t 1m –b 8g -o /mnt/lustre/test.`date+"%Y%m%d.%H%M%S“‘ -np 128 = all 16 nodes used with 4 slots (tasks) per node 8 slots per node (tasks) -b 8g = (2x32GB*16_Clients)/128_tasks Typically all nodes will be uniform, so we have to use lowest common denominator -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run -F : File per Process -t 1m: File transfer size of 1m -v : verbose output Defining IOR Parameters

Jumbo Frames has a >= 30% improvement on Lustre Performance compared to standard MTU of 1500 Change MTU on Client and Servers to 9000 Change MTU on the Switches to 9214 (or max MTU size) to accommodate for payload overhead Never set the MTU on the switch the same on the Clients and Servers Ethernet Tuning

Writes and Read in a single Operation: Direct IO -F : file per process -t 64m -B : DirectIO instead of the default Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better Baseline IOR Performance for ClusterStor GridRAID

Use the –w -r flag in IOR to for Write and Read results with Direct IOR -F : file per process -t 64m Write and Read operation: -w -r Direct IO: -B --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes and Reads mpirun –np byslot /bin/IOR –F -B –t 64mb -b 16g –w -r -o /mnt/lustre/fpp/testw.out IOR Write/Read for GridRAID

Stonewall can be used to perform a Write or Read test under a fixed time in seconds to ensure only maximum performance is measured Removes unbalanced task completion that can effect performance results Good to use when Clients are non-uniform architecture If used, ensure to specify a much bigger block size in IOR (-b) and a long enough time to write or read 2x client memory Typically, I increase –b in IOR by a factor of 4x or more Advanced IOR Option: Stonewall

Use the –w flag in IOR to for Only Write results with Direct IOR -F : file per process -t 1m Write only operation: -w Direct IO: -B -D 240 (4 min) Keep the output files: -k -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 512g –w -k -o /mnt/lustre/fpp/testw.out –D 240 IOR Write for GridRAID w/ Stonewall

Use the –r flag in IOR to for Only Read results with Direct IO -F : file per process -t 1m Read only operation: -r Direct IO: -B -D 240 (4 min) -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 512g –r -o /mnt/lustre/fpp/testw.out –D 120 IOR Read for GridRAID w/ Stonewall

Training Systems

hvt-super00 super00 hvt-client001 (SL6.4, el6.x86_64, 128GB, 8 Cores) hvt-super01 super01 hvt-client002 (SL6.4, el6.x86_64, 128GB, 8 Cores) hvt-super02 super02 hvt-client00 (SL6.4, el6.x86_64, 128GB, 8 Cores) hvt-super03 super03 hvt-client004 (SL6.4, el6.x86_64, 128GB, 8 Cores) Group 1: Ron C. / Tony hvt-asus100 asus100 hvt-client005 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus101 asus101 hvt-client006 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus102 asus102 hvt-client007 (SL6.4, el6.x86_64, 256GB, 24 Cores) Group 2: Rex T / Bill L hvt-asus103 asus103 hvt-client008 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus200 asus200 hvt-client009 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus201 asus201 hvt-client010 (SL6.4, el6.x86_64, 256GB, 24 Cores) Group 3: Randy / Mike S hvt-asus300 asus300 hvt-client013 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus301 asus301 hvt-client014 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus302 asus302 hvt-client016 (SL6.4, el6.x86_64, 256GB, 24 Cores) hvt-asus303 asus303 hvt-client016 (SL6.4, el6.x86_64, 256GB, 24 Cores) Group 4: Ron M / Dan N Compute Clients Clients

# ssh ~]# cscli fs_info Information about "tsefs2" file system: Node Node type Targets Failover partnerDevices tsesys2n02 mgs 0 / 0 tsesys2n03 tsesys2n03 mds 1 / 1 tsesys2n02 /dev/md66 tsesys2n04 oss 1 / 1 tsesys2n05 /dev/md0 tsesys2n05 oss 1 / 1 tsesys2n04 /dev/md1 tsesys2n06 oss 1 / 1 tsesys2n07 /dev/md0 tsesys2n07 oss 1 / 1 tsesys2n06 /dev/md1 ~]# ssh tsesys2n02 'lctl list_nids' Mount Command from Clients: mount –t lustre CS6000 GridRAID System

# ssh ~]# cscli fs_info Information about "fs1" file system: Node Node type Targets Failover partnerDevices hvt1sys02 mgs 0 / 0 hvt1sys03 hvt1sys03 mds 1 / 1 hvt1sys02 /dev/md66 hvt1sys04 oss 4 / 4 hvt1sys05 /dev/md0, /dev/md2, /dev/md4, /dev/md6 hvt1sys05 oss 4 / 4 hvt1sys04 /dev/md1, /dev/md3, /dev/md5, /dev/md7 ~]# ssh hvt1sys02 'lctl list_nids' Mount Command from Clients: mount –t lustre /mnt/fs1 CS6000 MDRAID System

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: Group 1: (4_nodes*128GB)*2 = 1024GB Group 2: (3_nodes*256GB)*2 = 1536GB Group 3: (3_nodes*256GB)*2 = 1536GB Group 4: (4_nodes*256GB)*2 = 2048GB This will be used to determine the IOR –b flag based on the mpirun –np flag with number of tasks IOR Transfer Size

Group 1 ~]# vi machinefile.txt super00 slots=4 max_slots=8 super01 slots=4 max_slots=8 super02 slots=4 max_slots=8 super03 slots=4 max_slots=8 Group 2 ~]# vi machinefile.txt asus100 slots=4 max_slots=24 asus101 slots=4 max_slots=24 asus102 slots=4 max_slots=24 Group 3 ~]# vi machinefile.txt asus103 slots=4 max_slots=24 asus201 slots=4 max_slots=24 asus202 slots=4 max_slots=24 Group 4 ~]# vi machinefile.txt asus300 slots=4 max_slots=24 asus301 slots=4 max_slots=24 asus302 slots=4 max_slots=24 Lab machinefile for Each Group

Before we run IOR, we want to confirm our configuration we just created. For base lining performance for each SSU using IOR, we need to make sure the stripe count for each directory is set to 1 and a stripe size set to 1M The command and output to do this is on the next slide. For Example, create a directory called benchmark under Lustre mount point on a Client # mkdir /mnt/lustre/benchmark Set Lustre stripe count to 1 and stripe size to 1m on a Client # lfs setstripe –c 1 –s 1m /mnt/lustre/benchmark Create and Confirm Lustre Stripe/Count from Client Client

Group 1 mpirun flags -np 32 = 4 clients will execute 8 tasks (slots) each --byslot distribution IOR flags -b 32g = We are transferring 1024GB total, 1024GB/32 = 32g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run Or can explicitly use –o /mnt/lustre/test.0 -F : File per Process -t 1m: File transfer size of 1m -w –r: Write and Read Flag in IOR -k: Keep output files Defining IOR Parameters for Group 1

Group 2 mpirun flags -np 24 = 3 clients will execute 8 tasks (slots) each --byslot distribution IOR flags -b 64g = We are transferring 1536GB total, 536GB/24 = 64g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run Or can explicitly use –o /mnt/lustre/test.0 -F : File per Process -t 1m: File transfer size of 1m -w –r: Write and Read Flag in IOR -k: Keep output files Defining IOR Parameters for Group 2

Group 3 mpirun flags -np 24 = 3 clients will execute 8 tasks (slots) each --byslot distribution IOR flags -b 64g = We are transferring 1536GB total, 536GB/24 = 64g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run Or can explicitly use –o /mnt/lustre/test.0 -F : File per Process -t 1m: File transfer size of 1m -w –r: Write and Read Flag in IOR -k: Keep output files Defining IOR Parameters for Group 3

Group 4 mpirun flags -np 32 = 4 clients will execute 8 tasks (slots) each --byslot distribution IOR flags -b 64g = We are transferring 2048GB total, 2048GB/32 = 64g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run Or can explicitly use –o /mnt/lustre/test.0 -F : File per Process -t 1m: File transfer size of 1m -w –r: Write and Read Flag in IOR -k: Keep output files Defining IOR Parameters for Group 4

First parameters changed disabled Wire Checksums disabled LRU Increase max_rpcs_in_flight to 32 or 256 (1.8.9 or 2.4.x/2.5.1) Increase max_dirty_mb to 128 or 256 (1.8.9 or 2.4.x/2.5.1) Procedure to optimize Client Side Tuning

Disable Client Checksums with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 32 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 32 > $n; done' Disable LRU ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 128 pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 128 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. Client Lustre tuning Parameters for 1.8.9

Disable Client Checksums with the specific FS name of cstorfs ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 256 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 256 > $n; done' Disable LRU ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 256 pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 256 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. Client Lustre tuning Parameters for 2.4.x/2.5.1

Checking the Checksum algorithm on the client ~]# pdsh -w client[1-12] 'cat /proc/fs/lustre/osc/cstorfs- OST00*/checksum_type’ Default is adler, but can change to crc32c ~]# pdsh -w client[1-12] ’for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksum type; do echo crc32 > $n; done’ Checksum Algorithm on Client

~]# cat clientprep.sh #! /bin/bash echo "Mount Lustre" pdsh -w client[1-12] 'mount -t lustre /mnt/lustre' sleep 3 echo "Disable Client Checksums" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/checksums; do echo 0 > $n; done' sleep 3 echo "Increase Max RPCs in Flight from 8 to 32" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_rpcs_in_flight; do echo 32 > $n; done' sleep 3 echo "Increase Max Dirty MB from 32 to 128" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 128 > $n; done' sleep 3 echo "Disable LRU" pdsh -w client[1-12] 'lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))' sleep 3 echo "Done Prepping Lustre Clients[1-12] for Benchmarking" Script to Mount and Tune Clients for 1.8.9

~]# cat clientprep.sh #! /bin/bash echo "Mount Lustre" pdsh -w client[1-12] 'mount -t lustre /mnt/lustre' sleep 3 echo "Disable Client Checksums" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/checksums; do echo 0 > $n; done' sleep 3 echo "Increase Max RPCs in Flight from 8 to 256" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_rpcs_in_flight; do echo 256 > $n; done' sleep 3 echo "Increase Max Dirty MB from 32 to 256" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 256 > $n; done' sleep 3 echo "Disable LRU" pdsh -w client[1-12] 'lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))' sleep 3 echo "Done Prepping Lustre Clients[1-12] for Benchmarking" Script to Mount and Tune Clients for 2.4.x/2.5.x

References

Benchmarking Best Practices ( Rice Oil and Gas Talk : Tuning and Measuring Performance ( LUG 2014 Client Performance Results ( IEEE Paper, Torben Kling-Petersen and John Fragalla, on Optimizing Performance for a HPC Storage System ( Neo Performance Folder ( References

Training Module: Client Connectivity/Configuration Benchmarking Thank You

Server Lustre Configuration Server Side Storage Pool Configuration

On the client using lsof and find the user still attached to the Lustre mount point # lsof | grep /mnt/lustre Kill the process, using a -9 if needed. # kill -9 Can’t umount lustre on Client

Listing all Storage Pools can be done running a command from the MDS node 3 server. ~]$ lctl dl 0 UP mgc c857b2ef-f624-e14a-9fb1-8ad7525e4fe4 5 1 UP lov cstorfs-MDT0000-mdtlov cstorfs-MDT0000-mdtlov_UUID 4 2 UP mdt cstorfs-MDT0000 cstorfs-MDT0000_UUID 3 3 UP mds mdd_obd-cstorfs-MDT0000 mdd_obd_uuid-cstorfs-MDT UP osc cstorfs-OST0015-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 5 UP osc cstorfs-OST000d-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 6 UP osc cstorfs-OST0005-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 7 UP osc cstorfs-OST0012-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 8 UP osc cstorfs-OST000a-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 9 UP osc cstorfs-OST0002-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 10 UP osc cstorfs-OST0016-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 11 UP osc cstorfs-OST000e-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 12 UP osc cstorfs-OST0006-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 13 UP osc cstorfs-OST0013-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 14 UP osc cstorfs-OST000b-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 Storage Pools: List all OSTs

From the output from the MDS node, we see the following OST assignment per SSU SSU1: OST[ ] SSU2: OST[0008-OST000f] SSU3: OST[ ] Storage Pools: OSTs per SSU

Creating Storage Pools are done on the node with MGS target is mounted, which is primarily node 02. We will create 3 storage pools, 1 per SSU, with the following command as root ~]# lctl pool_new cstorfs.ssu1 ~]# lctl pool_new cstorfs.ssu2 ~]# lctl pool_new cstorfs.ssu3 ~]# lctl pool_add cstorfs.ssu1 OST[ ] ~]# lctl pool_add cstorfs.ssu2 OST[ f] ~]# lctl pool_add cstorfs.ssu3 OST[ ] We will see an output similar to these Pool cstorfs.ssu3 not found Can't verify pool command since there is no local MDT or client, proceeding anyhow… Pool cstorfs.ssu3 not found Can't verify pool command since there is no local MDT or client, proceeding anyhow... This is OK because MGS is not mounted with MDT. If MGS and MDT was mounted on the same node, we will not see the above warnings. Storage Pools: Creation

First remove the OSTs from a Storage Pool that was created on the MGS node ~]# lctl pool_list cstorfs.ssu1 ~]# lctl pool_list cstorfs.ssu2 ~]# lctl pool_list cstorfs.ssu3 ~]# lctl pool_remove cstorfs.ssu1 OST[ ] ~]# lctl pool_remove cstorfs.ssu2 OST[ f] ~]# lctl pool_remove cstorfs.ssu3 OST[ ] Destroy (delete) the existence of the Storage Pool ~]# lctl pool_destroy cstorfs.ssu1 ~]# lctl pool_destroy cstorfs.ssu2 ~]# lctl pool_destroy cstorfs.ssu3 Storage Pools: Deletion

Server Lustre Configuration Client Side Storage Pool Configuration

Pick any client with Lustre mounted, and first we need to change directories into the mount point of Lustre lustre]# cd /mnt/lustre Lets first list the storage pools that we just created on the MGS node from this client lustre]# lctl pool_list cstrofs Pools from cstorfs: cstorfs.ssu3 cstorfs.ssu2 cstorfs.ssu1 Lets check the OST assignment of 1 of the storage pools lustre]# lctl pool_list cstrofs.ssu1 Pool: cstorfs.ssu1 cstorfs-OST0000_UUID cstorfs-OST0001_UUID cstorfs-OST0002_UUID cstorfs-OST0003_UUID cstorfs-OST0004_UUID cstorfs-OST0005_UUID cstorfs-OST0006_UUID cstorfs-OST0007_UUID Configuration Lustre to use Storage Pools

We are still in the /mnt/lustre directory and on the same Lustre Client, create 3 sub directories we will assign to each storage pool lustre]# mkdir ssu1 lustre]# mkdir ssu2 lustre]# mkdir ssu3 Assign the storage pool we created to each directory with a stripe count of 1 and stripe size of 1m lustre]# lfs setstripe -p cstorfs.ssu1 -c 1 -s 1m /mnt/lustre/ssu1 lustre]# lfs setstripe -p cstorfs.ssu2 -c 1 -s 1m /mnt/lustre/ssu2 lustre]# lfs setstripe -p cstorfs.ssu3 -c 1 -s 1m /mnt/lustre/ssu3 Assigning a Storage Pool to a Sub Directory

Before we run IOR, we want to confirm our configuration we just created. For base lining performance for each SSU using IOR, we need to make sure the stripe count for each directory is set to 1 and a stripe size set to 1M The command and output to do this is on the next slide. Confirm Lustre Stripe/Count and Pools

lustre]# lfs getstripe /mnt/lustre /mnt/lustre stripe_count: 1 stripe_size: stripe_offset: -1 /mnt/lustre/ssu3 stripe_count: 1 stripe_size: stripe_offset: -1 pool: ssu3 /mnt/lustre/ssu1 stripe_count: 1 stripe_size: stripe_offset: -1 pool: ssu1 /mnt/lustre/ssu2 stripe_count: 1 stripe_size: stripe_offset: -1 pool: ssu2 Confirm Lustre Stripe/Count and Pools

Server Lustre Configuration Creating Storage Pools for Testing

We need to create Lustre Storage Pools because 8 clients cannot stress our 3 SSU test system. Storage Pools allows us to test each SSU individually by dedicating the IOR output to the Storage Pool to we can baseline each SSU individually. Configuring Storage Pools is a combination of MGS and Client side configurations. NOTE: This is typically not needed for baseline performance at Customer sites. Lustre Storage Pools