Presentation is loading. Please wait.

Presentation is loading. Please wait.

Willkommen Welcome Bienvenue How we work with users in a small environment Patrik Burkhalter.

Similar presentations


Presentation on theme: "Willkommen Welcome Bienvenue How we work with users in a small environment Patrik Burkhalter."— Presentation transcript:

1 Willkommen Welcome Bienvenue How we work with users in a small environment HPC@Empa Patrik Burkhalter

2 How we work with users in a small environment Patrik Burkhalter System administrator HPC cluster at Empa At Empa since 2012 Linux system admin before Empa (mainly web, db and app servers)

3 Situation at Empa Agenda Situation at Empa Cluster support User support Enforcement

4 Situation at Empa At the moment, we have 2 clusters at Empa Ipazia, the cluster which we have since 2006 Hypatia, the new cluster which we have built this year The computing nodes from the old cluster will be dettached from Ipazia and connected to Hypatia step by step

5 Situation at Empa Ipazia the Empa HPC cluster 102 nodes (Dell) Built with the help of Partec and CSCS Parastation cluster middleware from Partec Torque resource manager Maui scheduler Infiniband DDR interconnect Lustre file systems

6 Situation at Empa Ipazia hardware Front end node PowerEdge 2900 2 * Intel(R) Xeon(R) CPU 5140 @ 2.33GHz (4 cores) 4GB RAM 1TB shared /home Computing nodes Node 1...30: deactivated, old 4 core pizza boxes Node 31...46: PowerEdge M605 2 * Quad-Core AMD Opteron(tm) Processor 2356 32GB RAM Node 47…102: PowerEdge M610 2 * Intel(R) Xeon(R) CPU E5540 @ 2.53GHz 24GB RAM

7 Situation at Empa Hypatia the new Empa cluster Built from scratch by Empa 32 nodes in 2 Dell M1000e chassis Torque resource manager Maui scheduler Infiniband FDR interconnect Lustre file systems Know-How completly @Empa (we have support for the SAN units) Well documented In production. Nodes from Ipazia are getting migrated to Hypatia soon.

8 Situation at Empa Hypatia hardware Front end node PowerEdge R620 2 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (16*2 cores, hyper threading) 32GB RAM Computing nodes PowerEdge M620 2 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (16 cores) 64GB RAM

9 Situation at Empa pbstop on Ipazia

10 Situation at Empa pbstop on Hypatia (new cluster)

11 Situation at Empa Lustre storage available to both clusters 25TB for backuped data (/project) 35TB speed optimized space (/scratch) Due to the amount of disks

12 Situation at Empa We changed our support model this year from external support to inhouse support. Why did we do this? We felt confident, that it is possible We can save money on the service contracts We can now fix (almost) everything by ourselves We can provide a better user support, because we have a deeper understanding How did we minimize the risk that we break the cluster We built a new cluster and did leave the running cluster alone A lot of users are using the new cluster already We can migrate the nodes to the new cluster when the stability is proven

13 Situation at Empa Ipazia Pizza nodes removed 2 new chassis 1 new front end 1 new SSD storage

14 Situation at Empa Support team Daniele Passerone (5% FTE) Carlo Pignedoli (5% FTE) Patrik Burkhalter (50% FTE)

15 Cluster Support Agenda Situation at Empa Cluster support User support Enforcement

16 Cluster Support Support we provide Introduction to basic Linux usage Connecting to the system using a SSH client Linux basic commands File system hierarchy Introduction of new users to the cluster Planning of future jobs Reservation of nodes for users Installation, compilation and testing of new software GNU and Intel compilers, MPI (openmpi/mvapich2), OpenFOAM, Abaqus Every software requested by the user System updates Hardware, OS Software updates Acquiring and installing new hardware New nodes GPU node Replacing failed hardware

17 Cluster Support Documentation of the cluster architecture

18 Cluster Support Documentation of the cluster usage

19 Cluster Support Lustre file system maintenance and extension At the moment, we are migrating our Lustre file systems workspc and storage to project and scratch while the file systems are online 1 complete new file system named project using new hardware SSD Meta data (MDT)

20 Cluster Support Lustre file system maintenance and extension 1 new file system scratch out of the file systems workspc and storage We deactivate one OST per fs from the old file systems We are using `lfs find’ to find the files having stripes on the deactivated OSTs We copy the files to a new location on the same fs Finally, we move them to the origin location

21 Cluster Support OST gets disabled temporary on the ionode This makes sure the OST will stay readable lctl dl | grep ‘ osc ‘ lctl –-device deactivate

22 Cluster Support Migration for files with an access time > 14 days Copies quickly but is kind of dirty TMPDIR="/mnt/storage/tmp" for i in $(lfs find --obd storage-OST0003 --atime +14 /mnt/storage); do DIR=$(dirname $i) FILE=$(basename $i) TMPPATH="$TMPDIR/$FILE"; SRCPATH="$DIR/$FILE"; # testing above values, continue to next entry if one test fails echo -en "$SRCPATH: " cp -p $SRCPATH $TMPPATH || exit 1 mv $TMPPATH $SRCPATH || exit 1 echo done done

23 Cluster Support Migration for newer files Checks if file was changed during the migration process Does not check if file is open on another node Therefore we only touch users which have no jobs and no running processes on the front end node lfs find --obd storage-OST0003 /mnt/storage/pbu | lfs_migrate -y

24 Cluster Support After the migration, the nodes gets deactivated permanently lctl conf_param storage-OST0003.osc.active=0

25 Cluster Support Situation after the migration

26 Cluster Support Problems we experience during the migration A lot of small files are hard to migrate The user tends to “hoard” data

27 Cluster Support We also provide several shell environments for the users to ease up the cluster usage. We are using the Modules environment (http://modules.sourceforge.net/)http://modules.sourceforge.net/ A module can be loaded with the command: `module load / ` The module sets the user environment variables as defined in the module We provide modules for each self compiled app and library This is particular handy for users which like to compile their own software We started to use this approach this year

28 Cluster Support Modules on Ipazia

29 Cluster Support Modules on Hypatia New modules are getting installed by user request

30 Cluster Support Example output of a module A simple module for ffmpeg We are trying to get rid of LD_LIBRARY_PATH and use RPATH instead This makes sure that a compiled binary uses the proper libraries independently from the user environment The module concept was new to our users but was accepted well

31 User Support Agenda Situation at Empa Cluster support User support Enforcement

32 Situation at Empa Users from Empa and Eawag ~120 users 40 active users in the last 30 days last | awk '{print $1}' | sort | uniq | wc –l

33 User Support Typical vendor to customer situation does not work at Empa We cannot provide a Service Level Agreement (SLA) We only can provide support on a best effort basis No support during the night or on weekend Unplanned down time can happen

34 User Support Typical IT user support does not work We cannot offer out of the box solution We don’t like to “just solve the problem now” We often don’t know the solution right now

35 User Support User as partner does work best for us The user gets threaded as equal. “If you think your users are idiots, only idiots will use it.” Linus Torvalds

36 User as a Partner The user has a strong scientific know how and sometimes just uses the software The engineer has a strong know how about clusters, but this means: A request by a scientist has to be reduced to the point at which the engineer is able to understand it The problem gets fixed by the engineer The solution gets communicated to the scientist in detail, until the scientist understands the particular situation It gets tested by the user It is important that each side understands the issue, otherwise potential optimization of the system gets lost.

37 User as a Partner If an user is experienced, tasks are getting delegated to the user. This could be: Compilation of apps and libraries Testing of a new package Problem analysis The solution always gets deployed by root to make sure all standards are fulfilled. If it is in the repository of our Linux distribution, it gets installed using the package manager If it is too old or not available, it gets compiled and installed in /share/apps or /share/libs The are modules provided to set the user environment module load / Our software gets compiled on a computing node and installed on the share file system

38 User as a Partner Example, Abaqus A Finite Element Method (FEM) software used by the mechanical systems engineering department of Empa. The users have a strong background in mechanical engineering The users are using Abaqus on Windows to engineer parts We made a wrapper to simplify the job submission

39 Enforcement Agenda Situation at Empa Cluster support User support Enforcement

40 At the moment, we only do enforcement of: Obviously - the root password is not given to the users Disk quotas are in place (size and inodes) Maui scheduling configuration Optimization is planned for Hypatia, the new cluster

41 Enforcement login screen provides some information to make the user aware of the cluster situation

42 Thanks for listening Any questions, thoughts?


Download ppt "Willkommen Welcome Bienvenue How we work with users in a small environment Patrik Burkhalter."

Similar presentations


Ads by Google