Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking.

Client Configuration Lustre Benchmarking

Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking Base-lining for MDRAID obdfilter-survery IOR Agenda

Client Configuration for SSH Setting up ssh keys

All clients must have functioning SSH server which allows both direct root access and key based authentication. You still need to generate a master key on your head node, then copy this key into the ~/.ssh/authorized_keys file. Required ssh keys on Clients

Login to the Client head node Generate the master key on Client Head node # ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: 8c:d1:39:a7:68:a3:e1:5f:d9:95:b3:e6:13:6a:8e:cc root@fvt- client1.xyus.xyratex.comroot@fvt- client1.xyus.xyratex.com Read the generated public key from the head node # cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwZsS68UMWSXaybwAxnaHq30VIL0uM54VVgiJmTLZQ/qFhH0/GP6WTSUPk5U/eiRRc1Lhfp7AY3VWdKQ2wv084EMC+9uPu Fht9ugOPaPI4yVFYskZ+NNYKb6v07hGW10wD25jMPZ/omxsVx1cHt25KlDc+FA2Wj1mxK6x61vQayPxQh4WFHhCgM30TsllrAB9SHh37+ookHTeY8xpQpbunR GCyBrRFqVLcusnho4P5zZrtSrKlPLjKIy1kg43hVgzSk6ae5FVSvaYQmubQb1Q31ftrwne7zqCLjfhudkgsETBDJtteWZPFUpRZYpbtvOkfCqa/XiSrOY8Xc/ nxq0Dvw== root@fvt-client1.xyus.xyratex.com How to setup SSH keys on the Client

Logon to each client and make the.ssh folder and copy the pub key generated on head node to each client. For example, looking at a sample configuration for one client, client2 [root@fvt-client1 ~]# ssh client2 root@client2's password: Last login: Fri May 17 01:52:53 2013 from 10.106.54.18 [root@fvt-client2 ~]# mkdir -m 0700 ~/.ssh [root@fvt-client2 ~]# cat >> ~/.ssh/authorized_keys ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwZsS68UMWSXaybwAxnaHq30VIL0uM54VVgiJmTLZQ/qFhH0/GP6WTSUPk 5U/eiRRc1Lhfp7AY3VWdKQ2wv084EMC+9uPuFht9ugOPaPI4yVFYskZ+NNYKb6v07hGW10wD25jMPZ/omxsVx 1cHt25KlDc+FA2Wj1mxK6x61vQayPxQh4WFHhCgM30TsllrAB9SHh37+ookHTeY8xpQpbunRGCyBrRFqVLcus nho4P5zZrtSrKlPLjKIy1kg43hVgzSk6ae5FVSvaYQmubQb1Q31ftrwne7zqCLjfhudkgsETBDJtteWZPFUpR ZYpbtvOkfCqa/XiSrOY8Xc/nxq0Dvw== root@fvt-client1.xyus.xyratex.com ]0;root@fvt-client2:~[root@fvt-client2 ~]# exit Use the key to configured all clients

Client Configuration for pdsh Setting up pdsh

Install pdsh on the head node # yum install –y pdsh /etc/hosts is already configured with hostname for you How to use pdsh for each Group to Confirm if SSH is configured correctly Group 1: # pdsh –w super[00-03] date Installing and using pdsh

Client Installation of MPI

OpenMPI is needed to execute IOR on all clients using the command ‘mpirun’ From the head node, run the following commands: [root@fvt-client1.ssh]# pdsh -w client[1-12] ‘yum install -y openmpi openmpi-devel’ Add openmpi to your path [root@fvt-client1.ssh]# pdsh -w client[1-12] ‘ldconfig /usr/lib64/openmpi/lib’ root@fvt-client1.ssh]# pdsh -w client[1-12] ’export PATH=$PATH:/usr/lib64/openmpi/bin:/usr/lib64/openmpi/lib’ Check Path before continuing [root@fvt-client1.ssh]# pdsh -w client[1-12] ‘echo $PATH’ If Path is not correct, might need to look at shell and edit the.bashrc file for example and source the file, than copy to all clients export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib Install openmpi on all nodes

Client InfiniBand Installation Stock OFED

InfiniBand is essential to run IOR using IBoRDMA. From the head node, run the following command to install IB. [root@fvt-client1 ~]# pdsh –w client [1-12] ‘yum groupinstall -y "Infiniband Support”’ [root@fvt-client1 ~]# pdsh –w client [1-12] ‘yum install -y infiniband-diags’ Start RDMA and Bring Up ib0 interface [root@fvt-client1 ~]# pdsh –w client [1-12] ‘service rdma start’ [root@fvt-client1 ~]# pdsh –w client [1-12] ‘ifup ib0’ Install the OFED/InfiniBand Packages

Client Download and Installation of IOR

From the Head node, download build IOR To build IOR, need to install autoconf and make package, and development tools which installs the make utility [root@fvt-client1 ~]# yum install -y make autoconf [root@fvt-client1 ~]# yum groupinstall -y “Development Tools” Download the IOR tool [root@fvt-client1 ~]# wget http://downloads.sourceforge.net/project/ior- sio/IOR%20latest/IOR-2.10.3/IOR-2.10.3.tgz Download and build IOR

IOR is downloaded to the local directory, Untar/ungzip it and run the make utility to build IOR [root@fvt-client1 ~]# tar -zxvf IOR-2.10.3.tgz Goto the IOR Directory that was untarred/unzipped [root@fvt-client1 IOR]# make (cd./src/C && make posix) make[1]: Entering directory `/root/IOR/src/C’ mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c IOR.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c utilities.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c parse_options.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-POSIX.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noMPIIO.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noHDF5.c mpicc -g -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -c aiori-noNCMPI.c mpicc -o IOR IOR.o utilities.o parse_options.o aiori-POSIX.o aiori-noMPIIO.o aiori-noHDF5.o aiori-noNCMPI.o \ -lm make[1]: Leaving directory `/root/IOR/src/C’ [root@fvt-client1 ~]# cp /root/IOR/src/C/IOR. Download and build IOR

Copy IOR to all Clients [root@fvt-client1 ~]# pdcp –w client[2-12] IOR /root/. Confirm IOR linked with the correct library [root@fvt-client1 ~]# pdsh -w client[1-12] "ldd IOR | grep mpi" client5: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003b45800000) client7: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00000031e6c00000) client8: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x000000375a400000) client6: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00000038e7200000) client1: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003ebee00000) client3: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00000033dbe00000) client2: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003fe0000000) client11: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007fa077379000) client12: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f8c5bf6e000) client4: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00000030a8400000) client9: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f8f658e3000) client10: libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f475a019000) Download and build IOR

Client Lustre Client Installation Installing Lustre Client RPMs

Two packages are required to be installed or built on the Clients lustre-client-modules-.rpm -- Lustre Patchless Client Modules lustre-client-.rpm -- Lustre utilities Needs to be confirmed through the site survey and work with ClusterStor Support organization to obtain the correct RPMs to install at the customer site If you don’t have a pre-build Lustre client for your particular client, than you will have to build the client using the SRC RPM package Download Clients from: http://downloads.whamcloud.com/public/lustre/ Client Lustre Packages

© Xyratex 2013 Just to make sure: Unmount Lustre (if already mounted) # umount /mnt/lustre Unload Lustre Modules # lustre_rmmod Install Kernel development and Compilers [root@hvt-super00 ]# yum install –y kernel-devel libselinux-devel rpm-build [root@hvt-super00 ]# yum groupinstall –y “Development Tools” Download Lustre SRC RPM and install it # rpm -ivh --nodeps lustre-client-2.4.3-2.6.32_358.23.2.el6.x86_64.src.rpm 1:lustre-client warning: user jenkins does not exist - using root warning: group jenkins does not exist - using root ########################################### [100%] warning: user jenkins does not exist - using root warning: group jenkins does not exist - using root The above output states the rpmbuild directory for this client is using /root, only a warning Building Lustre Clients

© Xyratex 2013 Go to the following directory and do the following [root@hvt-super00 SOURCES]# cd /root/rpmbuild/SOURCES/ [root@hvt-super00 SOURCES]# ls lustre-2.4.3.tar.gz [root@hvt-super00 SOURCES]# gunzip lustre-2.4.3.tar.gz [root@hvt-super00 SOURCES]# tar xvf lustre-2.4.3.tar CD into the following directory: [root@hvt-super00 SOURCES]# cd lustre-2.4.3 [root@hvt-super00 lustre-2.4.3]# pwd /root/rpmbuild/SOURCES/lustre-2.4.3 Go to the SRC RPM Directory installation and run the command to build the RPM for your specific Kernel and assuming OS Stock OFED # make distclean #./configure --disable-server --with-linux=/usr/src/kernels/2.6.32- 358.el6.x86_64 # make && make rpms Building Lustre Clients

© Xyratex 2013 RPM just built can be found in the following directory: # /root/rpmbuild/RPMS/x86_64 Two packages are required to be installed or built on the Clients lustre-client-modules-.rpm -- Lustre Patchless Client Modules lustre-client-.rpm -- Lustre utilities Copy the built RPMs on the other Clients and install, if all clients are the same OS and Kernel # rpm –ivh lustre-client-modules-2.4.3-2.6.32_358.el6.x86_64.x86_64.rpm # rpm –ivh lustre-client-2.4.3-2.6.32_358.el6.x86_64.x86_64.rpm NOTE: 2.4.3 can be 1.8.9 or 2.5.1 Install the Lustre Client RPMs

© Xyratex 2013 Edit or Create /etc/modprobe.d/lnet.conf options lnet networks=“o2ib0(ib0)” – For IB Nodes options lnet networks=“tcp(eth20)” – For Ethernet Nodes Install the two Lustre Client RPMs just build Start Lustre # modprobe lustre Possible start RDMA on IB system and bringup ib0 # service rdma start # ifup ib0 Configuration and Starting Lustre on Clients

# ssh admin@10.0.159.56admin@10.0.159.56 [root@tsesys2n00 ~]# cscli fs_info ------------------------------------------------------------------------------------- Information about "tsefs2" file system: ------------------------------------------------------------------------------------- Node Node type Targets Failover partnerDevices ------------------------------------------------------------------------------------- tsesys2n02 mgs 0 / 0 tsesys2n03 tsesys2n03 mds 1 / 1 tsesys2n02 /dev/md66 tsesys2n04 oss 1 / 1 tsesys2n05 /dev/md0 tsesys2n05 oss 1 / 1 tsesys2n04 /dev/md1 tsesys2n06 oss 1 / 1 tsesys2n07 /dev/md0 tsesys2n07 oss 1 / 1 tsesys2n06 /dev/md1 [root@tsesys2n00 ~]# ssh tsesys2n02 'lctl list_nids' 172.21.2.3@o2ib Mount Command from Clients: mount –t lustre 172.21.2.3@o2ib:172.21.2.4@o2ib:/tsefs2 /mnt/tsefs2172.21.2.3@o2ib:172.21.2.4@o2ib:/tsefs2 CS6000 GridRAID System

# ssh admin@10.0.159.12admin@10.0.159.12 [root@hvt1sys00 ~]# cscli fs_info ---------------------------------------------------------------------------------------------------- Information about "fs1" file system: ---------------------------------------------------------------------------------------------------- Node Node type Targets Failover partnerDevices ---------------------------------------------------------------------------------------------------- hvt1sys02 mgs 0 / 0 hvt1sys03 hvt1sys03 mds 1 / 1 hvt1sys02 /dev/md66 hvt1sys04 oss 4 / 4 hvt1sys05 /dev/md0, /dev/md2, /dev/md4, /dev/md6 hvt1sys05 oss 4 / 4 hvt1sys04 /dev/md1, /dev/md3, /dev/md5, /dev/md7 [root@hvt1sys00 ~]# ssh hvt1sys02 'lctl list_nids' 172.18.1.4@o2ib Mount Command from Clients: mount –t lustre 172.18.1.4@o2ib:172.18.1.5@o2ib:/fs1 /mnt/fs1 CS6000 MDRAID System

To find out the InfiniBand IP address and LNET name to mount Lustre on the client, logon to the MGS node that has MGT mounted and run the following command [root@mgs ~]# lctl list_nids 172.18.1.3@o2ib0 172.18.1.3@tcp 172.18.1.3 – IP address of ib0 on MGS node (MGS and MDS can run on the same node or different nodes) o2ib0 – Default LNET network for RDMAoIB Good practice is use a mount option of MGS primary and secondary, to allow the clients to still access the filesystem in the event of MGS target failing over from primary to secondary node mount -t lustre 172.18.1.3@o2ib0:172.18.1.4@o2ib0:/fsname /mnt/lustre Mount Lustre on the Clients

We first need to create the mount point, then issue the mount command [root@client~]# pdsh -w client[1-200] ‘mkdir /mnt/lustre’ [root@client~]# pdsh -w client[1-200] ‘mount -t lustre 172.18.1.3@o2ib0:172.18.1.4@o2ib0:/fsname /mnt/lustre’ Check if all clients mounted successfully [root@client~]# pdsh -w client[1-200] 'mount -t lustre' | wc –l 200 Check the state of the filesystem from one client with the following command. We have 36 OSS servers, the output should be 144 OSTs [root@fvt-client1 ~]# lfs check servers | grep OST | wc –l 146 Example of Mounting Lustre on All client

Client Lustre Tuning

Network Checksums Default is turned on and impacts performance. Disabling this is first thing we do for performance LRU Size Typically we disable this parameter Parameter used to control the number of client-side locks in an LRU queue Max RPCs in Flight Default is 8 RPC is remote procedure call This tunable is the maximum number of concurrent RPCs in flight from from clients. Max Dirty MB Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight. Defines the amount of MBs of dirty data can be written and queued up on the client Client Lustre Parameters

First thing to always do is disable Wire Checksums on the client and disable LRU max_rpcs_in_flight and max_dirty_mb are a product of number of clients available for benchmarks. Typically, we increase max_rpcs_in_flight to 32 for 1.8.9, and to 256 for 2.4.x/2.5.x Clients In some cases if we still don’t get performance, than we increase max_dirty_mb to 4x the current value for 1.8.9 or the same as 2.4.x/2.5.1 Clients of max_rpcs_in_flight Procedures for Benchmarking 1.Disable Checksums 2.Disable LRU 3.Increase max_rpcs_in_flight for specific client 4.Increase max_dirty_mb for specific client Procedure to optimize Client Side Tuning

Disable Client Checksums with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 32 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 32 > $n; done' Disable LRU [root@fvt-client1 ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 128 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_dirty_mb; do echo 128 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. 1.8.9 Client Lustre tuning Parameters

Disable Client Checksums with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 256 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 256 > $n; done' Disable LRU [root@fvt-client1 ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 256 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_dirty_mb; do echo 256 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. 2.4.x/2.5.x Client Lustre tuning Parameters

Based on the LUG 2014 Client Performance and Comparison, surprising results keeping Checksums Enabled only has about an upto 5% impact on performance Default algorithm is Adler32, but CRC32 is also available, and suggest using CRC32 due to HW support for acceleration on CPU technologies Enable Client Checksums with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 1 > $n; done’ Select CRC32 Client Checksum Type with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksum_type; do echo crc32 > $n; done’ A Note on Checksums

Jumbo Frames has a >= 30% improvement on Lustre Performance compared to standard MTU of 1500 Change MTU on Client and Servers to 9000 Change MTU on the Switches to 9214 (or max MTU size) to accommodate for payload overhead Never set the MTU on the switch the same on the Clients and Servers Ethernet Tuning

Server Side Benchmark for MDRAID Using obdfilter-survey

Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance This is a good benchmark to isolate network and client from the server This run from the primary management node Must run as root to execute obdfilter-survey on the OSS nodes. Server Side Benchmark

Before running obdfilter-survey, want to make sure all Targets are mounted on their primary servers. CS6000 Health [root@lmtest400 ~]# pdsh -g lustre 'grep -c lustre /proc/mounts' | dshbak –c ---------------- lmtest[402-403] ---------------- 1 ---------------- lmtest[404-409] ---------------- 4 If the output is different from above, use HA to failover/failback resources to their primary servers before proceeding Check CS6000 Configuration for Health

nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey Parameters Defines Size=65536 // file size which is 2x Controller Memory nobjhi=1 nobjlo=1 // number of files thrhi=256 thrlo=256 // number of worker threads when testing OSS and SSU The results for each OSS should be in the range of 3000MB/s on write and 3500MB/s on read If you see results significantly lower, rerun the test multiple times to ensure those anomalies are not consistent. NOTE: obdfilter-survey is intrusive and requires to be run as root, and occasionally can induce a LBUG on the OSS node, don’t be alarmed. obdfilter-survey Setup for CS6000 MDRAID

[root@lmtest400 ~]# pdsh -g oss ’nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey' lmtest408: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest408 lmtest409: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest409 lmtest404: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest404 lmtest406: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest406 lmtest407: Sun May 19 17:01:48 PDT 2013 Obdfilter-survey for case=disk from lmtest407 lmtest405: Sun May 19 17:01:47 PDT 2013 Obdfilter-survey for case=disk from lmtest405 lmtest406: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3271.13 [ 670.75, 953.31] rewrite 3224.33 [ 664.84, 966.80] read 3840.65 [ 647.85,1228.86] lmtest409: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3151.11 [ 557.89, 910.83] rewrite 3130.93 [ 649.86, 911.84] read 4004.42 [ 966.89,1040.84] lmtest408: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3131.36 [ 574.93, 926.86] rewrite 3127.69 [ 585.89, 923.82] read 4016.74 [ 965.76,1053.87] lmtest407: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3258.88 [ 607.92, 941.74] rewrite 3159.85 [ 669.24, 909.82] read 3766.40 [ 753.85,1233.84] lmtest406: done! lmtest408: done! lmtest409: done! lmtest407: done! lmtest405: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3121.14 [ 583.42, 920.82] rewrite 2967.64 [ 618.92, 902.00] read 3605.18 [ 769.83,1207.71] lmtest404: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3119.20 [ 607.75, 916.85] rewrite 2951.78 [ 539.93, 913.83] read 3560.45 [ 505.90,1231.84] lmtest404: done! lmtest405: done! obdfilter-survey Results for CS6000 MDRAID with 256 worker threads

Server Side Benchmark for GridRAID Using obdfilter-survey

Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance This is a good benchmark to isolate network and client from the server This run from the primary management node Must run as root to execute obdfilter-survey on the OSS nodes. Server Side Benchmark

Before running obdfilter-survey, want to make sure all Targets are mounted on their primary servers. CS9000 Health [root@lmtest400 ~]# pdsh -g lustre 'grep -c lustre /proc/mounts' | dshbak –c ---------------- lmtest[402-403] ---------------- 1 ---------------- lmtest[404-409] ---------------- 1 If the output is different from above, use HA to failover/failback resources to their primary servers before proceeding Check ClusterStor Configuration for Health

The pre-allocation size of LDISKFS is not set correct for GridRAID and will need to be changed for optimal performance. NOTE: This is fixed in ClusterStor 1.5 First, check the pre-allocation size on each OSS: [root@tsesys2n00 ~]# pdsh -g oss 'cat /proc/fs/ldiskfs/md*/prealloc_table' tsesys2n04: 32 64 128 tsesys2n06: 32 64 128 tsesys2n05: 32 64 128 tsesys2n07: 32 64 128 To correct the pre-allocation size: [root@tsesys2n00 ~]# pdsh -g oss 'echo "256 512 1024 2048 4096" > /proc/fs/ldiskfs/md*/prealloc_table’ ClusterStor 1.4 Tuning Change Required

nobjlo=1 nobjhi=1 thrlo=512 thrhi=512 size=131072 obdfilter-survey Parameters Defines Size=131072 // file size which is 2x Controller Memory nobjhi=1 nobjlo=1 // number of files thrhi=512 thrlo=512 // number of worker threads when testing OSS and SSU The results for each CS9000 OSS should be in the range of 4300MB/s on write and read The results for each CS6000 OSS should be in the range of 3100MB/s on writes, and 3700MB/s reads If you see results significantly lower, rerun the test multiple times to ensure those anomalies are not consistent. NOTE: obdfilter-survey is intrusive and requires to be run as root, and occasionally can induce a LBUG on the OSS node, don’t be alarmed. Use this for CS6000 and CS9000 SSUs obdfilter-survey Setup for ClusterStor GridRAID SSU

[root@tsesys2n00 ~]# pdsh -g oss 'nobjlo=1 nobjhi=1 thrlo=512 thrhi=512 size=131072 obdfilter-survey' tsesys2n05: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n05 tsesys2n07: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n07 tsesys2n06: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n06 tsesys2n04: Mon May 5 10:43:04 PDT 2014 Obdfilter-survey for case=disk from tsesys2n04 tsesys2n06: ost 1 sz 134217728K rsz 1024K obj 1 thr 512 write 3339.93 [3300.63,3505.66] rewrite 3315.21 [3220.80,3464.34] read 3815.69 [3691.21,4115.70] tsesys2n05: ost 1 sz 134217728K rsz 1024K obj 1 thr 512 write 3313.65 [3245.19,3500.17] rewrite 3296.45 [2877.87,3495.95] read 3783.49 [3508.85,4071.69] tsesys2n07: ost 1 sz 134217728K rsz 1024K obj 1 thr 512 write 3294.08 [3246.90,3459.60] rewrite 3280.59 [3231.68,3434.76] read 3802.78 [3519.90,3990.34] tsesys2n04: ost 1 sz 134217728K rsz 1024K obj 1 thr 512 write 3271.72 [3195.85,3422.54] rewrite 3269.96 [3256.71,3464.56] read 3772.00 [3600.70,3972.05] tsesys2n06: done! tsesys2n05: done! tsesys2n07: done! tsesys2n04: done! obdfilter-survey Results for GridRAID with 512 worker threads

Client Side Benchmark for MDRAID Typical IOR Client Configuration

At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. NOTE: Our configuration for this training is a bit unique and required additional thought to get performance per SSU With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking A minimum of 8 Clients per SSU MDRAID Typical Client Configuration

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: (8_clients*32GB_memory)*2 = 512GB Total file size for the IOR benchmark will be 512GB NOTE: Typically all nodes are uniform. IOR Rule of Thumb

mpirun is used to execute IOR across all clients from the head node Within mpirun we use the following options -machinefile machinefile.txt -np : total number of tasks to execute on all clients (e.g. –np 64 state 8 tasks per client with 8 clients --byslot -machinefile option is a simple text file listing all clients to execute the IOR benchmark -np defines the number of tasks --byslot defines how many tasks are executed on the first node before starting additional tasks on the second node, so on and so forth. This is tied to how the machinefile options are defined --bynode is another option which executes 1 task per node before executing additional tasks per node. Using MPI to execute IOR

Create a simple file called ‘machinefile.txt’ listing all the clients with the following options slots=4 max_slots=‘Max Number of CPU Cores’ In an example, 16 cores. Because we edited the /etc/host file with the client IP address, we only need to use the associated hostname for each client listed in the /etc/host file. This is also true if DNS is used at the customer site, no need to define node names in /etc/host, or one can use the IPv4 address in the machine file. Creating the machinefile on the head node

[root@fvt-client1 ~]# vi machinefile.txt client1 slots=4 max_slots=16 client2 slots=4 max_slots=16 client3 slots=4 max_slots=16 client4 slots=4 max_slots=16 ………… client13 slots=4 max_slots=16 client14 slots=4 max_slots=16 client15 slots=4 max_slots=16 client16 slots=4 max_slots=16 :wq! With the Xyratex defined machinefile, and using --byslot option, we will use 4 slots the first node, then 4 slots in the second node, so on and so forth If using --bynode, we will round-robin the number of slots per node regardless of the machinefile configuration Slots = Tasks per node Sample machinefile for ClusterStor

Typical IOR Parameter for 16 nodes with 32GB of memory is /usr/lib64/openmpi/bin/mpirun -machinefile machinefile.txt –np 128 –byslot./IOR -v - F -t 1m –b 8g -o /mnt/lustre/test.`date+"%Y%m%d.%H%M%S“‘ -np 128 = all 16 nodes used with 4 slots (tasks) per node 8 slots per node (tasks) -b 8g = (2x32GB*16_Clients)/128_tasks Typically all nodes will be uniform, so we have to use lowest common denominator -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run -F : File per Process -t 1m: File transfer size of 1m -v : verbose output Defining IOR Parameters

Writes: Buffered IO -F : file per process -t 1m Default is Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better Reads: Direct IO -F: File per Process -t 64m -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better --byslot option Baseline IOR Performance for CS6000 MDRAID

Use the –w flag in IOR to for Only Write results with Buffered IOR -F : file per process -t 1m Write only operation: -w Default is Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR -F -t 1mb -b 16g - w -o /mnt/lustre/fpp/testw.out IOR Write for MDRAID

We first need to write the data, than read back using Direct IO -F: File per Process Use the write and read flag: -w -r -t 64m -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Reads mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb -b 16g –w -r -o /mnt/lustre/fpp/testr.out IOR Reads for MDRAID

Stonewall can be used to perform a Write or Read test under a fixed time in seconds to ensure only maximum performance is measured Removes unbalanced task completion that can effect performance results Good to use when Clients are non-uniform architecture If used, ensure to specify a much bigger block size in IOR (-b) and a long enough time to write or read 2x client memory Typically, I increase –b in IOR by a factor of 4x or more Advanced IOR Option: Stonewall

Use the –w flag in IOR to for Only Write results with Buffered IOR -F : file per process -t 1m Write only operation: -w Default is Buffered IO -D 240 (4 min) -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR -F -t 1mb –b 512g -w -o /mnt/lustre/fpp/testw.out –D 240 IOR Write for MDRAID w/ Stonewall

We first need to write the data, than read back using Direct IO - F: File per Process Use the write and read flag: -w -r -t 64m With Stonewall, we need to write without stonewall first, than read back in a separate IOR command using Stonewall -k: Keep the Write output test files to read back using Stonewall -B : DirectIO instead of the default Buffered IO -np and –b will be a product of each other to achieve 6GB/s per SSU or better IOR Reads for MDRAID w/ Stonewall

Step 1: Set Lustre Stripe Size and Stripe Count lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Step 2: IOR Write with Direct IO with large enough block size to read back at least 2x client memory and keep the output file (Want the write to complete) mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 64g –w -k -o /mnt/lustre/fpp/testr.out Step 3: IOR Read back the output test file from Step 2 with Stonewall Option mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 64g –r -o /mnt/lustre/fpp/testr.out – D 240 IOR Reads for MDRAID w/ Stonewall

Client Side Benchmark for GridRAID Typical IOR Client Configuration

At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory. NOTE: Our configuration for this training is a bit unique and required additional thought to get performance per SSU With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking A minimum of 16 Clients per SSU GridRAID Typical Client Configuration

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: (16*32GB)*2 = 1024GB Total file size for the IOR benchmark will be 1024GB NOTE: Typically all nodes are uniform. IOR Rule of Thumb

mpirun is used to execute IOR across all clients from the head node Within mpirun we use the following options -machinefile machinefile.txt -np : total number of tasks to execute on all clients (e.g. –np 64 state 8 tasks per client with 8 clients --byslot -machinefile option is a simple text file listing all clients to execute the IOR benchmark -np defines the number of tasks --byslot defines how many tasks are executed on the first node before starting additional tasks on the second node, so on and so forth. This is tied to how the machinefile options are defined --bynode is another option which executes 1 task per node before executing additional tasks per node. Using MPI to execute IOR

Create a simple file called ‘machinefile.txt’ listing all the clients with the following options slots=4 max_slots=‘Max Number of CPU Cores’ In an example, 16 cores. Because we edited the /etc/host file with the client IP address, we only need to use the associated hostname for each client listed in the /etc/host file. This is also true if DNS is used at the customer site, no need to define node names in /etc/host, or one can use the IPv4 address in the machine file. Creating the machinefile on the head node

[root@fvt-client1 ~]# vi machinefile.txt client1 slots=4 max_slots=16 client2 slots=4 max_slots=16 client3 slots=4 max_slots=16 client4 slots=4 max_slots=16 ………… client13 slots=4 max_slots=16 client14 slots=4 max_slots=16 client15 slots=4 max_slots=16 client16 slots=4 max_slots=16 :wq! With the Xyratex defined machinefile, and using --byslot option, we will use 4 slots the first node, then 4 slots in the second node, so on and so forth If using --bynode, we will round-robin the number of slots per node regardless of the machinefile configuration Slots = Tasks per node Sample machinefile for ClusterStor

Typical IOR Parameter for 16 nodes with 32GB of memory is /usr/lib64/openmpi/bin/mpirun -machinefile machinefile.txt –np 128 --byslot./IOR -v - F -t 1m –b 8g -o /mnt/lustre/test.`date+"%Y%m%d.%H%M%S“‘ -np 128 = all 16 nodes used with 4 slots (tasks) per node 8 slots per node (tasks) -b 8g = (2x32GB*16_Clients)/128_tasks Typically all nodes will be uniform, so we have to use lowest common denominator -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run -F : File per Process -t 1m: File transfer size of 1m -v : verbose output Defining IOR Parameters

Writes and Read in a single Operation: Direct IO -F : file per process -t 64m -B : DirectIO instead of the default Buffered IO --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better Baseline IOR Performance for ClusterStor GridRAID

Use the –w -r flag in IOR to for Write and Read results with Direct IOR -F : file per process -t 64m Write and Read operation: -w -r Direct IO: -B --byslot option -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes and Reads mpirun –np 256 --byslot /bin/IOR –F -B –t 64mb -b 16g –w -r -o /mnt/lustre/fpp/testw.out IOR Write/Read for GridRAID

Stonewall can be used to perform a Write or Read test under a fixed time in seconds to ensure only maximum performance is measured Removes unbalanced task completion that can effect performance results Good to use when Clients are non-uniform architecture If used, ensure to specify a much bigger block size in IOR (-b) and a long enough time to write or read 2x client memory Typically, I increase –b in IOR by a factor of 4x or more Advanced IOR Option: Stonewall

Use the –w flag in IOR to for Only Write results with Direct IOR -F : file per process -t 1m Write only operation: -w Direct IO: -B -D 240 (4 min) Keep the output files: -k -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 512g –w -k -o /mnt/lustre/fpp/testw.out –D 240 IOR Write for GridRAID w/ Stonewall

Use the –r flag in IOR to for Only Read results with Direct IO -F : file per process -t 1m Read only operation: -r Direct IO: -B -D 240 (4 min) -np and –b will be a product of each other to achieve 6GB/s per SSU or better First, ensure Lustre stripe size of 1m and stripe count of 1 lfs setstripe -s 1m –c 1 /mnt/lustre/fpp Execute IOR for Writes mpirun –np 64 --byslot /bin/IOR –F -B –t 64mb –b 512g –r -o /mnt/lustre/fpp/testw.out –D 120 IOR Read for GridRAID w/ Stonewall

Training Systems

10.0.159.100hvt-super00 super00 hvt-client001 (SL6.4, 2.6.32-358.el6.x86_64, 128GB, 8 Cores) 10.0.159.101hvt-super01 super01 hvt-client002 (SL6.4, 2.6.32-358.el6.x86_64, 128GB, 8 Cores) 10.0.159.102hvt-super02 super02 hvt-client00 (SL6.4, 2.6.32-358.el6.x86_64, 128GB, 8 Cores) 10.0.159.103hvt-super03 super03 hvt-client004 (SL6.4, 2.6.32-358.el6.x86_64, 128GB, 8 Cores) Group 1: Ron C. / Tony 10.0.159.104hvt-asus100 asus100 hvt-client005 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.105hvt-asus101 asus101 hvt-client006 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.106hvt-asus102 asus102 hvt-client007 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) Group 2: Rex T / Bill L 10.0.159.107hvt-asus103 asus103 hvt-client008 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.108hvt-asus200 asus200 hvt-client009 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.109hvt-asus201 asus201 hvt-client010 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) Group 3: Randy / Mike S 10.0.159.112hvt-asus300 asus300 hvt-client013 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.113hvt-asus301 asus301 hvt-client014 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.114hvt-asus302 asus302 hvt-client016 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) 10.0.159.115hvt-asus303 asus303 hvt-client016 (SL6.4, 2.6.32-358.el6.x86_64, 256GB, 24 Cores) Group 4: Ron M / Dan N Compute Clients Clients

# ssh admin@10.0.159.56admin@10.0.159.56 [root@tsesys2n00 ~]# cscli fs_info ------------------------------------------------------------------------------------- Information about "tsefs2" file system: ------------------------------------------------------------------------------------- Node Node type Targets Failover partnerDevices ------------------------------------------------------------------------------------- tsesys2n02 mgs 0 / 0 tsesys2n03 tsesys2n03 mds 1 / 1 tsesys2n02 /dev/md66 tsesys2n04 oss 1 / 1 tsesys2n05 /dev/md0 tsesys2n05 oss 1 / 1 tsesys2n04 /dev/md1 tsesys2n06 oss 1 / 1 tsesys2n07 /dev/md0 tsesys2n07 oss 1 / 1 tsesys2n06 /dev/md1 [root@tsesys2n00 ~]# ssh tsesys2n02 'lctl list_nids' 172.21.2.3@o2ib Mount Command from Clients: mount –t lustre 172.21.2.3@o2ib:172.21.2.4@o2ib:/tsefs2 /mnt/tsefs2172.21.2.3@o2ib:172.21.2.4@o2ib:/tsefs2 CS6000 GridRAID System

# ssh admin@10.0.159.12admin@10.0.159.12 [root@hvt1sys00 ~]# cscli fs_info ---------------------------------------------------------------------------------------------------- Information about "fs1" file system: ---------------------------------------------------------------------------------------------------- Node Node type Targets Failover partnerDevices ---------------------------------------------------------------------------------------------------- hvt1sys02 mgs 0 / 0 hvt1sys03 hvt1sys03 mds 1 / 1 hvt1sys02 /dev/md66 hvt1sys04 oss 4 / 4 hvt1sys05 /dev/md0, /dev/md2, /dev/md4, /dev/md6 hvt1sys05 oss 4 / 4 hvt1sys04 /dev/md1, /dev/md3, /dev/md5, /dev/md7 [root@hvt1sys00 ~]# ssh hvt1sys02 'lctl list_nids' 172.18.1.4@o2ib Mount Command from Clients: mount –t lustre 172.18.1.4@o2ib:172.18.1.5@o2ib:/fs1 /mnt/fs1 CS6000 MDRAID System

Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect In our example: Group 1: (4_nodes*128GB)*2 = 1024GB Group 2: (3_nodes*256GB)*2 = 1536GB Group 3: (3_nodes*256GB)*2 = 1536GB Group 4: (4_nodes*256GB)*2 = 2048GB This will be used to determine the IOR –b flag based on the mpirun –np flag with number of tasks IOR Transfer Size

Group 1 [root@fvt-super00 ~]# vi machinefile.txt super00 slots=4 max_slots=8 super01 slots=4 max_slots=8 super02 slots=4 max_slots=8 super03 slots=4 max_slots=8 Group 2 [root@fvt-asus100 ~]# vi machinefile.txt asus100 slots=4 max_slots=24 asus101 slots=4 max_slots=24 asus102 slots=4 max_slots=24 Group 3 [root@fvt-asus103 ~]# vi machinefile.txt asus103 slots=4 max_slots=24 asus201 slots=4 max_slots=24 asus202 slots=4 max_slots=24 Group 4 [root@fvt-asus300 ~]# vi machinefile.txt asus300 slots=4 max_slots=24 asus301 slots=4 max_slots=24 asus302 slots=4 max_slots=24 Lab machinefile for Each Group

Before we run IOR, we want to confirm our configuration we just created. For base lining performance for each SSU using IOR, we need to make sure the stripe count for each directory is set to 1 and a stripe size set to 1M The command and output to do this is on the next slide. For Example, create a directory called benchmark under Lustre mount point on a Client # mkdir /mnt/lustre/benchmark Set Lustre stripe count to 1 and stripe size to 1m on a Client # lfs setstripe –c 1 –s 1m /mnt/lustre/benchmark Create and Confirm Lustre Stripe/Count from Client Client

Group 1 mpirun flags -np 32 = 4 clients will execute 8 tasks (slots) each --byslot distribution IOR flags -b 32g = We are transferring 1024GB total, 1024GB/32 = 32g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S“` Found using different output test files provides better performance than reusing the same filename for each run Or can explicitly use –o /mnt/lustre/test.0 -F : File per Process -t 1m: File transfer size of 1m -w –r: Write and Read Flag in IOR -k: Keep output files Defining IOR Parameters for Group 1

First parameters changed disabled Wire Checksums disabled LRU Increase max_rpcs_in_flight to 32 or 256 (1.8.9 or 2.4.x/2.5.1) Increase max_dirty_mb to 128 or 256 (1.8.9 or 2.4.x/2.5.1) Procedure to optimize Client Side Tuning

Disable Client Checksums with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 32 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 32 > $n; done' Disable LRU [root@fvt-client1 ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 128 pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 128 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. Client Lustre tuning Parameters for 1.8.9

Disable Client Checksums with the specific FS name of cstorfs [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksums; do echo 0 > $n; done’ Increase Max RPCs in flight from default 8 to 256 [root@fvt-client1 ~]# pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs- OST00*/max_rpcs_in_flight; do echo 256 > $n; done' Disable LRU [root@fvt-client1 ~]# pdsh -w client[1-12] ‘lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))’ Increase Max Dirty MB from 32 to 256 pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 256 > $n; done' NOTE: These settings are not persistent and will need to be reset if re-mount Lustre or reboot the Client. Client Lustre tuning Parameters for 2.4.x/2.5.1

Checking the Checksum algorithm on the client [root@fvt-client1 ~]# pdsh -w client[1-12] 'cat /proc/fs/lustre/osc/cstorfs- OST00*/checksum_type’ Default is adler, but can change to crc32c [root@fvt-client1 ~]# pdsh -w client[1-12] ’for n in /proc/fs/lustre/osc/cstorfs- OST00*/checksum type; do echo crc32 > $n; done’ Checksum Algorithm on Client

[root@fvt-client1 ~]# cat clientprep.sh #! /bin/bash echo "Mount Lustre" pdsh -w client[1-12] 'mount -t lustre 172.18.1.3@o2ib0:172.18.1.4@o2ib0:/cstorfs /mnt/lustre' sleep 3 echo "Disable Client Checksums" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/checksums; do echo 0 > $n; done' sleep 3 echo "Increase Max RPCs in Flight from 8 to 32" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_rpcs_in_flight; do echo 32 > $n; done' sleep 3 echo "Increase Max Dirty MB from 32 to 128" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 128 > $n; done' sleep 3 echo "Disable LRU" pdsh -w client[1-12] 'lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))' sleep 3 echo "Done Prepping Lustre Clients[1-12] for Benchmarking" Script to Mount and Tune Clients for 1.8.9

[root@fvt-client1 ~]# cat clientprep.sh #! /bin/bash echo "Mount Lustre" pdsh -w client[1-12] 'mount -t lustre 172.18.1.3@o2ib0:172.18.1.4@o2ib0:/cstorfs /mnt/lustre' sleep 3 echo "Disable Client Checksums" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/checksums; do echo 0 > $n; done' sleep 3 echo "Increase Max RPCs in Flight from 8 to 256" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_rpcs_in_flight; do echo 256 > $n; done' sleep 3 echo "Increase Max Dirty MB from 32 to 256" pdsh -w client[1-12] 'for n in /proc/fs/lustre/osc/cstorfs-OST00*/max_dirty_mb; do echo 256 > $n; done' sleep 3 echo "Disable LRU" pdsh -w client[1-12] 'lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))' sleep 3 echo "Done Prepping Lustre Clients[1-12] for Benchmarking" Script to Mount and Tune Clients for 2.4.x/2.5.x

References

Benchmarking Best Practices (http://goo.gl/3wSY8M)http://goo.gl/3wSY8M Rice Oil and Gas Talk : Tuning and Measuring Performance (http://goo.gl/CKoodO)http://goo.gl/CKoodO LUG 2014 Client Performance Results (http://goo.gl/e7XVLG)http://goo.gl/e7XVLG IEEE Paper, Torben Kling-Petersen and John Fragalla, on Optimizing Performance for a HPC Storage System (http://goo.gl/e7XVLG)http://goo.gl/e7XVLG Neo Performance Folder (http://goo.gl/CKoodO)http://goo.gl/CKoodO References

Training Module: Client Connectivity/Configuration Benchmarking John_Fragalla@xyratex.com Bill_Loewe@xyratex.com Thank You

Server Lustre Configuration Server Side Storage Pool Configuration

On the client using lsof and find the user still attached to the Lustre mount point # lsof | grep /mnt/lustre Kill the process, using a -9 if needed. # kill -9 Can’t umount lustre on Client

Listing all Storage Pools can be done running a command from the MDS node 3 server. [admin@lmtest403 ~]$ lctl dl 0 UP mgc MGC172.18.1.3@o2ib c857b2ef-f624-e14a-9fb1-8ad7525e4fe4 5 1 UP lov cstorfs-MDT0000-mdtlov cstorfs-MDT0000-mdtlov_UUID 4 2 UP mdt cstorfs-MDT0000 cstorfs-MDT0000_UUID 3 3 UP mds mdd_obd-cstorfs-MDT0000 mdd_obd_uuid-cstorfs-MDT0000 3 4 UP osc cstorfs-OST0015-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 5 UP osc cstorfs-OST000d-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 6 UP osc cstorfs-OST0005-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 7 UP osc cstorfs-OST0012-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 8 UP osc cstorfs-OST000a-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 9 UP osc cstorfs-OST0002-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 10 UP osc cstorfs-OST0016-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 11 UP osc cstorfs-OST000e-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 12 UP osc cstorfs-OST0006-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 13 UP osc cstorfs-OST0013-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 14 UP osc cstorfs-OST000b-osc-MDT0000 cstorfs-MDT0000-mdtlov_UUID 5 Storage Pools: List all OSTs

From the output from the MDS node, we see the following OST assignment per SSU SSU1: OST[0000-0007] SSU2: OST[0008-OST000f] SSU3: OST[0010-0017] Storage Pools: OSTs per SSU

Creating Storage Pools are done on the node with MGS target is mounted, which is primarily node 02. We will create 3 storage pools, 1 per SSU, with the following command as root [root@lmtest402 ~]# lctl pool_new cstorfs.ssu1 [root@lmtest402 ~]# lctl pool_new cstorfs.ssu2 [root@lmtest402 ~]# lctl pool_new cstorfs.ssu3 [root@lmtest402 ~]# lctl pool_add cstorfs.ssu1 OST[0000-0007] [root@lmtest402 ~]# lctl pool_add cstorfs.ssu2 OST[0008-000f] [root@lmtest402 ~]# lctl pool_add cstorfs.ssu3 OST[0010-0017] We will see an output similar to these Pool cstorfs.ssu3 not found Can't verify pool command since there is no local MDT or client, proceeding anyhow… Pool cstorfs.ssu3 not found Can't verify pool command since there is no local MDT or client, proceeding anyhow... This is OK because MGS is not mounted with MDT. If MGS and MDT was mounted on the same node, we will not see the above warnings. Storage Pools: Creation

First remove the OSTs from a Storage Pool that was created on the MGS node [root@lmtest402 ~]# lctl pool_list cstorfs.ssu1 [root@lmtest402 ~]# lctl pool_list cstorfs.ssu2 [root@lmtest402 ~]# lctl pool_list cstorfs.ssu3 [root@lmtest402 ~]# lctl pool_remove cstorfs.ssu1 OST[0000-0007] [root@lmtest402 ~]# lctl pool_remove cstorfs.ssu2 OST[0008-000f] [root@lmtest402 ~]# lctl pool_remove cstorfs.ssu3 OST[0018-0017] Destroy (delete) the existence of the Storage Pool [root@lmtest402 ~]# lctl pool_destroy cstorfs.ssu1 [root@lmtest402 ~]# lctl pool_destroy cstorfs.ssu2 [root@lmtest402 ~]# lctl pool_destroy cstorfs.ssu3 Storage Pools: Deletion

Server Lustre Configuration Client Side Storage Pool Configuration

Pick any client with Lustre mounted, and first we need to change directories into the mount point of Lustre [root@fvt-client1 lustre]# cd /mnt/lustre Lets first list the storage pools that we just created on the MGS node from this client [root@fvt-client1 lustre]# lctl pool_list cstrofs Pools from cstorfs: cstorfs.ssu3 cstorfs.ssu2 cstorfs.ssu1 Lets check the OST assignment of 1 of the storage pools [root@fvt-client1 lustre]# lctl pool_list cstrofs.ssu1 Pool: cstorfs.ssu1 cstorfs-OST0000_UUID cstorfs-OST0001_UUID cstorfs-OST0002_UUID cstorfs-OST0003_UUID cstorfs-OST0004_UUID cstorfs-OST0005_UUID cstorfs-OST0006_UUID cstorfs-OST0007_UUID Configuration Lustre to use Storage Pools

We are still in the /mnt/lustre directory and on the same Lustre Client, create 3 sub directories we will assign to each storage pool [root@fvt-client1 lustre]# mkdir ssu1 [root@fvt-client1 lustre]# mkdir ssu2 [root@fvt-client1 lustre]# mkdir ssu3 Assign the storage pool we created to each directory with a stripe count of 1 and stripe size of 1m [root@fvt-client1 lustre]# lfs setstripe -p cstorfs.ssu1 -c 1 -s 1m /mnt/lustre/ssu1 [root@fvt-client1 lustre]# lfs setstripe -p cstorfs.ssu2 -c 1 -s 1m /mnt/lustre/ssu2 [root@fvt-client1 lustre]# lfs setstripe -p cstorfs.ssu3 -c 1 -s 1m /mnt/lustre/ssu3 Assigning a Storage Pool to a Sub Directory

Before we run IOR, we want to confirm our configuration we just created. For base lining performance for each SSU using IOR, we need to make sure the stripe count for each directory is set to 1 and a stripe size set to 1M The command and output to do this is on the next slide. Confirm Lustre Stripe/Count and Pools

[root@fvt-client1 lustre]# lfs getstripe /mnt/lustre /mnt/lustre stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 /mnt/lustre/ssu3 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 pool: ssu3 /mnt/lustre/ssu1 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 pool: ssu1 /mnt/lustre/ssu2 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 pool: ssu2 Confirm Lustre Stripe/Count and Pools

Server Lustre Configuration Creating Storage Pools for Testing

We need to create Lustre Storage Pools because 8 clients cannot stress our 3 SSU test system. Storage Pools allows us to test each SSU individually by dedicating the IOR output to the Storage Pool to we can baseline each SSU individually. Configuring Storage Pools is a combination of MGS and Client side configurations. NOTE: This is typically not needed for baseline performance at Customer sites. Lustre Storage Pools

Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking.

Similar presentations

Presentation on theme: "Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking.

Similar presentations

Presentation on theme: "Client Configuration Lustre Benchmarking. Client Setup and Package Installation Client Lustre Configuration Client Tuning Parameters Lustre Striping Benchmarking."— Presentation transcript:

Similar presentations

About project

Feedback