Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using The Cluster. What We’ll Be Doing Add users Run Linpack Compile code Compute Node Management.

Similar presentations


Presentation on theme: "Using The Cluster. What We’ll Be Doing Add users Run Linpack Compile code Compute Node Management."— Presentation transcript:

1 Using The Cluster

2 What We’ll Be Doing Add users Run Linpack Compile code Compute Node Management

3 Add a User

4 Adding a User Account Execute: # useradd

5 Output from ‘useradd’ Creating user: gb make: Entering directory `/var/411' /opt/rocks/bin/411put --comment="#" /etc/auto.home 411 Wrote: /etc/411.d/etc.auto..home Size: 514/207 bytes (encrypted/plain) Alert: sent on channel 239.2.11.71 with master 10.1.1.1 /opt/rocks/bin/411put --comment="#" /etc/passwd 411 Wrote: /etc/411.d/etc.passwd Size: 2565/1722 bytes (encrypted/plain) Alert: sent on channel 239.2.11.71 with master 10.1.1.1 /opt/rocks/bin/411put --comment="#" /etc/shadow 411 Wrote: /etc/411.d/etc.shadow Size: 1714/1093 bytes (encrypted/plain) Alert: sent on channel 239.2.11.71 with master 10.1.1.1 /opt/rocks/bin/411put --comment="#" /etc/group 411 Wrote: /etc/411.d/etc.group Size: 1163/687 bytes (encrypted/plain) Alert: sent on channel 239.2.11.71 with master 10.1.1.1 make: Leaving directory `/var/411'

6 411 Secure Information Service Secure NIS replacement Distributes files within the cluster  Default 411 configuration is to distribute user account files, but one can use 411 to distribute any file to all nodes

7 411 Secure Information Service When a 411 monitored file changes, an alert is multicast  When a node receives an alert, it pulls the file associated with the alert Compute nodes periodically pull all files under the control of 411

8 User Accounts All user accounts are housed on the frontend under:  /export/home/ All nodes use ‘autofs’ to automatically mount the user directory when a user logs into a node  This method provides for a simple global file system On the frontend and every compute node, the user account is available at “/home/ ”

9 Deleting a User Use: # userdel Note: the user’s home directory (/export/home/ ) will not be removed  For safety, this must be removed by hand

10 Running Linpack

11 Linpack Linpack is a floating point matrix multiply benchmark Measures sustained floating-point operations per second  “Giga flops” - 1 billion floating point operations per second This benchmark is used to rate the Top500 fastest supercomputers in the world We use it as a comprehensive test of the system  Stresses the CPU  Uses the MPICH layer  Sends a modest number of messages  Ensures a user can launch a job on all nodes  Can run through queueing system to also test queueing system

12 Running Linpack From the Command Line Make a ‘machines’ file  Execute: vi machines  Input the following: compute-0-0 Get a test Linpack configuration file: $ cp /var/www/html/rocks-documentation/3.2.0/examples/HPL.dat. # su - Login as non-root user

13 Run It Load your ssh key into your environment: $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 \ -machinefile machines /opt/hpl/gnu/bin/xhpl $ ssh-agent $SHELL $ ssh-add Execute Linpack: Flags:  -nolocal : don’t run Linpack on host that is launching the job  -np 2 : give the job 2 processors  -machinefile : run the job on the nodes specified in the file ‘machines’

14 Successful Linpack Output The following parameter values will be used: N : 2000 NB : 64 P : 1 Q : 2 PFACT : Left Crout Right NBMIN : 8 NDIV : 2 RFACT : Right BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 80) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words ---------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- W11R2L8 2000 64 1 2 1.96 2.724e+00 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.1049227...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0255037...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0055411...... PASSED

15 Running Linpack Through a Job Management System Get a test SGE submission script: $ cp /var/www/html/rocks-documentation/3.2.0/examples/sge-qsub-test.sh. Examine the script  Most of the script concerns adding (and removing) a temporary ssh key to your environment

16 /opt/mpich/gnu/bin/mpirun -nolocal -np $NSLOTS \ -machinefile $TMPDIR/machines \ /opt/hpl/gnu/bin/xhpl Important Part Of The Script At the top  Requested number of processors In the middle  What job to run #$ -pe mpi 2

17 Submit the Job Send the job off to SGE: $ qsub sge-qsub-test.sh

18 Monitoring the Job Command line $ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- compute-0-0q BIP 2/2 99.99 glinux 3 0 sge-qsub-t bruno r 06/03/2004 02:48:15 MASTER 0 sge-qsub-t bruno r 06/03/2004 02:48:15 SLAVE Status

19 Job Output SGE writes 4 files:  sge-qsub-test.sh.e0 Stderr for job ‘0’  sge-qsub-test.sh.o0 Stdout for job ‘0’  sge-qsub-test.sh.pe0 Stderr from the queueing system regarding job ‘0’  sge-qsub-test.sh.po0 Stdout from the queueing system regarding job ‘0’

20 Removing a Job from the Queue Execute: $ qdel queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- compute-0-0q BIP 2/2 99.99 glinux 3 0 sge-qsub-t bruno r 06/03/2004 02:48:15 MASTER 0 sge-qsub-t bruno r 06/03/2004 02:48:15 SLAVE Find the job id with ‘qstat -f’ To remove the job above: $ qdel 3

21 Monitoring SGE Via The Web Setup access to web server  Local access Configure X: redhat-config-xfree86  Remote access Open http port in “/etc/sysconfig/iptables”  Or, port forwarding “ssh root@stakkato.rocksclusters.org -L 8080:localhost:80” Then point web browser to “http://localhost:8080”

22 Frontend Web Page

23 SGE Job Monitoring

24

25 Ganglia Monitoring

26

27 Scaling Up Linpack Tell SGE to allocate more processors  Edit ‘sge-qsub-test.sh’ and change: #$ -pe mpi 2  To: #$ -pe mpi 4 Tell Linpack to use more processors  Edit ‘HPL.dat’ and change 1 Ps  To: 2 Ps  The number of processors Linpack uses is P * Q

28 Scaling Up Linpack Submit the larger job $ qsub sge-qsub-test.sh To make Linpack use more memory (and increase performance, edit ‘HPL.dat’ and change 1000 Ns  To: 4000 Ns  Linpack operates on an N * N matrix  Goal: consume 75% of memory on each compute node

29 Using Linpack Over Myrinet Scale up the job in the same manner as described in the previous slides. Submit the Myrinet-based job $ qsub sge-qsub-test-myri.sh Get a test Myrinet SGE submission script: $ cp /var/www/html/rocks-documentation/3.2.0/examples/sge-qsub-test-myri.sh.

30 Executing Commands Across the Cluster Collect “ps” status  cluster-ps  To get the status of all the processes being executed by user ‘bruno’ Execute: cluster-ps bruno Kill processes  cluster-kill  To kill all the Linpack jobs Execute: cluster-kill xhpl Execute any command line executable  cluster-fork  To restart the ‘autofs’ service on all compute nodes Execute: cluster-fork “service autofs restart”

31 Executing Commands Across the Cluster All cluster-* commands can query the database to generate a node list  To restart the ‘autofs’ service only on the nodes in cabinet 1 Execute: cluster-fork --query=“select name from nodes where rack=1” “service autofs restart”

32 Compile Code

33 Compile Test MPI Program with gcc Compile cpi $ cp /opt/mpich/gnu/examples/cpi.c. $ cp /opt/mpich/gnu/examples/Makefile. $ make cpi /opt/mpich/gnu/bin/mpicc -c cpi.c /opt/mpich/gnu/bin/mpicc -o cpi cpi.o -lm Run it $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 -machinefile machines $HOME/cpi/cpi Process 0 on compute-2-1.local pi is approximately 3.1416009869231241, Error is 0.0000083333333309 wall clock time = 0.000650 Process 1 on compute-2-1.local

34 Compile Test MPI Program with gcc Compile cpi $ cp /opt/mpich/gnu/examples/cpi.c $HOME $ cp /opt/mpich/gnu/examples/Makefile $HOME $ make cpi /opt/mpich/gnu/bin/mpicc -c cpi.c /opt/mpich/gnu/bin/mpicc -o cpi cpi.o -lm Run it $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 -machinefile machines $HOME/cpi Process 0 on compute-2-1.local pi is approximately 3.1416009869231241, Error is 0.0000083333333309 wall clock time = 0.000650 Process 1 on compute-2-1.local

35 Compile MPI Code with Intel Compiler Simply change ‘gnu’ to ‘intel’ $ cp /opt/mpich/intel/examples/cpi.c $HOME $ cp /opt/mpich/intel/examples/Makefile $HOME $ make cpi /opt/mpich/intel/bin/mpicc -c cpi.c /opt/mpich/intel/bin/mpicc -o cpi cpi.o -lm

36 Bring In Your Own Code FTP your code to the frontend Let’s compile and try to run it!

37 Compute Node Management

38 Adding a Compute Node Execute “insert-ethers” If adding to a specific rack:  For example, if adding to cabinet 2: “insert-ethers --cabinet=2” If adding to a specific location within a rack:  “insert-ethers --cabinet=2 rank=4”

39 Replacing a Dead Node To replace node compute-0-4: # insert-ethers --replace=“compute-0-4” Remove the dead node Power up the new node Put the new node into “installation mode”  Boot with Rocks Base CD, PXE boot, etc. The next node that issues a DHCP request will assume the role of compute-0-4

40 Removing a Node If decommissioning a node: # insert-ethers --remove=“compute-0-2” Insert-ethers will remove all traces of compute-0- 2 from the database and restart all relevant services  You will not be asked for any input


Download ppt "Using The Cluster. What We’ll Be Doing Add users Run Linpack Compile code Compute Node Management."

Similar presentations


Ads by Google