High-Performance Computing at the Martinos Center Iman Aganj Why & How September 20, 2018
High-Performance Computing (HPC) What? The use of advanced (usually shared) computing resources to solve large computational problems quickly and efficiently. Why? Processing of large datasets in parallel. Access to remote resources that aren’t locally available, e.g., Big chunks of memory GPUs How? Remote job submission.
What HPC resources are available to us? Martinos Center Compute Cluster: Launchpad Icepuffs Partners Research Computing: ERISOne Linux Cluster Windows Analysis Servers MGH & BWH Center for Clinical Data Science Harvard Medical School Research Computing: O2 External: Open Science Grid Mass Open Cloud
Martinos Center Compute Cluster Launchpad Resources: 105 nodes, each with: 8 cores (total of ~ 840 cores) 56GB of memory GPUs: 7 × Tesla M2050 Job scheduler: PBS www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php
Martinos Center Compute Cluster Icepuffs Resources: 3 Icepuff nodes, each with: 64 cores 256GB of memory Pros (Launchpad & Icepuffs): NMR network folders are already mounted. Exclusive to Martinos members. Latest version of FreeSurfer is ready to use. www.nmr.mgh.harvard.edu/martinos/userInfo/computer/icepuffs.php
Partners Research Computing ERISOne Linux Cluster Resources: 380 nodes, each with: up to 36 cores (total of ~ 7000 cores) up to 512GB of memory A 3TB-memory server with 64 cores GPUs: 4 × Tesla P100 (+ new V100s) 24 × Tesla M2070 Job scheduler: LSF Pros: Some directories are mounted on the NMR network. High-memory jobs (up to 498GB in the big queue). https://rc.partners.org/kb/article/2164
Partners Research Computing Windows Analysis Servers Resources: 2 Windows machines: HPCWin2 (32 cores, 256GB of memory) HPCWin3 (32 cores, 320GB of memory) Connection using the Remote Desktop Protocol: rdesktop hpcwin3.research.partners.org Use PARTNERS\PartnersID to log in. Pros: Run Windows applications. Quick access to MS Office. https://rc.partners.org/kb/computational-resources/windows-analysis-servers?article=2652
Mechanism: Resources: Pros: Email the abstract of the project. Become their collaborative partner. Resources: GPUs: NVIDIA Deep Learning boxes (DGX-1 with Tesla V100 GPUs) Tesla P100 GPUs Dedicated clusters Pros: Fastest existing GPUs. Perfect for deep learning. www.ccds.io
Harvard Medical School Research Computing Resources: 268 nodes, each with: up to 32 cores (total of 8064 cores, soon: 11000 cores) 256GB of memory Soon: 10 high-memory nodes 768GB of memory GPUs 8 × Tesla M40 (4 × 24GB, 4 × 12GB) 16 × Tesla K80 (12GB each) Soon: 16 × Tesla V100 with NVLink Job scheduler: Slurm Pros: Available to both quad & non-quad HMS affiliates (and their RAs). Often underused and not congested. Many Matlab licenses with most toolboxes, including Matlab Distributed Computing Server. https://wiki.rc.hms.harvard.edu/display/O2
Launchpad www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php
Getting Started with Launchpad Request access: Email: help@nmr.mgh.harvard.edu Login to Launchpad ssh launchpad Need help? Read the documentation: www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php Email: batch-users@nmr.mgh.harvard.edu
Submitting and Checking the Status of a Job pbsubmit -c "echo Started; sleep 30; echo Finished" Opening pbsjob_2 qsub -V -S /bin/sh -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_2 14779540.launchpad.nmr.mgh.harvard.edu qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 14779540.launchp iman e_defaul pbsjob_2 -- 1 1 -- 96:00 Q -- 14779540.launchp iman e_defaul pbsjob_2 3472 1 1 -- 96:00 R -- Job Name Job ID Status
Viewing the Output of the Job jobinfo -o 14779540 Started qstat -u iman jobinfo 14779540 JOB INFO FOR 14779540: Queued on 09/20/2017 18:26:10 Started on 09/20/2017 18:30:51 Ended on 09/20/2017 18:31:21 Run on host compute-0-57 User is iman Cputime: 00:00:00 Walltime: 00:00:30 (Limit: 96:00:00) Resident Memory: 3520kb Virtual Memory: 321640kb (Limit: 7gb) Exit status: 0 cat /pbs/iman/pbsjob_2.o14779540 Finished
Cancelling A Job Cancel a specific job: Cancel all my jobs: qdel 14779540 Cancel all my jobs: qselect -u iman | xargs qdel qdel all
Requesting More Resources 1 core ~ 7GB of memory Request 2 cores and 14GB of memory: pbsubmit -n 2 -c "echo Test." Opening pbsjob_3 qsub -V -S /bin/sh -l nodes=1:ppn=2,vmem=14gb -r n /pbs/iman/pbsjob_3 14783620.launchpad.nmr.mgh.harvard.edu Request 8 days of wall time (instead of the default 4 days): pbsubmit -q extended -c "echo Test." qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 14783622.launchp iman e_extend pbsjob_4 -- 1 1 -- 196:0 Q --
Queues pbsubmit -q max100 -c "echo Test." Queue Name Starting Priority Max CPU Slots default 10000 150 max500 8800 500 max200 9400 200 max100 100 max75 75 max50 50 max20 20 max10 10 p5 p10 p20 10300 p30 10600 p40 10900 p50 11200 p60 11500 pbsubmit -q max100 -c "echo Test."
Queues GPU pbsubmit -q GPU -c "jobGPU" Opening pbsjob_9 qsub -V -S /bin/sh -q GPU -l nodes=1:GPU:ppn=5 -r n /pbs/iman/pbsjob_9 14783690.launchpad.nmr.mgh.harvard.edu jobinfo pbsjob_9 JOB INFO FOR 14783690: Queued on 09/21/2017 13:56:03 Started on 09/21/2017 13:56:20 Ended on Run on host compute-0-80 User is iman Cputime: Walltime: (Limit: ) Resident Memory: Virtual Memory: (Limit: ) Exit status:
Queues GPU ssh compute-0-80 Last login: Tue Sep 13 20:34:39 2016 from launchpad.nmr.mgh.harvard.edu top top - 13:57:21 up 395 days, 17:12, 1 user, load average: 0.99, 0.27, 0.09 Tasks: 254 total, 1 running, 253 sleeping, 0 stopped, 0 zombie Cpu(s): 7.2%us, 5.3%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32879188k total, 5153368k used, 27725820k free, 269984k buffers Swap: 67108860k total, 0k used, 67108860k free, 3756420k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30953 iman 20 0 47.5g 544m 147m S 99.7 1.7 0:54.41 MATLAB 1 root 20 0 25676 1672 1324 S 0.0 0.0 0:13.28 init 2 root 20 0 0 0 0 S 0.0 0.0 0:03.25 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 8:03.52 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:07.81 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0 6 root RT 0 0 0 0 S 0.0 0.0 0:22.12 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:02.34 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1 9 root 20 0 0 0 0 S 0.0 0.0 0:02.05 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:09.43 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:17.38 migration/2
Queues GPU ssh compute-0-80 Last login: Tue Sep 13 20:34:39 2016 from launchpad.nmr.mgh.harvard.edu nvidia-smi Thu Sep 21 13:57:01 2017 +------------------------------------------------------+ | NVIDIA-SMI 361.28 Driver Version: 361.28 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M2050 Off | 0000:0B:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 1628MiB / 2687MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 30953 C ...ofs/cluster/matlab/8.6/bin/glnxa64/MATLAB 1620MiB |
Queues highio To deal with the I/O bottleneck: pbsubmit -q highio -c "jobHighIO" To deal with the I/O bottleneck: During the job’s lifetime, keep the data in: /cluster Create the temporary files in: /cluster/scratch Submit multiple jobs with some delay in between, e.g. by interleaving the sleep command between the job submission commands. Use the highio queue so there are no more than a total of 20 jobs with high I/O running on Launchpad.
Interactive Jobs qsub -I -V -X -q p60 qsub: waiting for job 14783853.launchpad.nmr.mgh.harvard.edu to start qsub: job 14783853.launchpad.nmr.mgh.harvard.edu ready hostname compute-0-6 . exit
Email Notifications Email received when the job started running: pbsubmit –m MartinosID -c "echo Test." pbsubmit -m iman -c "echo Started; sleep 30; echo Finished" Opening pbsjob_12 qsub -V -S /bin/sh -m abe -M iman -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_12 14783855.launchpad.nmr.mgh.harvard.edu Email received when the job started running: PBS Job Id: 14783855.launchpad.nmr.mgh.harvard.edu Job Name: pbsjob_12 Exec host: compute-0-37/6 Begun execution Email received when the job ended: Execution terminated Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3516kb resources_used.vmem=321640kb resources_used.walltime=00:00:30
Using Matlab on Launchpad Matlab licenses are limited! Compile your Matlab code so you can run it without a license: Use the mcc command in Matlab. See JP Coutu’s guide to use deploytool of Matlab: http://nmr.mgh.harvard.edu/martinos/itgroup/deploytool.html Submit the job to run the compiled executable file.
“NIH Instrumentation Grants Thank You! If you use Launchpad in your research, please cite the “NIH Instrumentation Grants 1S10RR023401, 1S10RR019307, and 1S10RR023043” in your publication.