High-Performance Computing at the Martinos Center

High-Performance Computing at the Martinos Center
Iman Aganj Why & How September 20, 2018

High-Performance Computing (HPC)
What? The use of advanced (usually shared) computing resources to solve large computational problems quickly and efficiently. Why? Processing of large datasets in parallel. Access to remote resources that aren’t locally available, e.g., Big chunks of memory GPUs How? Remote job submission.

What HPC resources are available to us?
Martinos Center Compute Cluster: Launchpad Icepuffs Partners Research Computing: ERISOne Linux Cluster Windows Analysis Servers MGH & BWH Center for Clinical Data Science Harvard Medical School Research Computing: O2 External: Open Science Grid Mass Open Cloud

Martinos Center Compute Cluster
Launchpad Resources: 105 nodes, each with: 8 cores (total of ~ 840 cores) 56GB of memory GPUs: 7 × Tesla M2050 Job scheduler: PBS

Martinos Center Compute Cluster
Icepuffs Resources: 3 Icepuff nodes, each with: 64 cores 256GB of memory Pros (Launchpad & Icepuffs): NMR network folders are already mounted. Exclusive to Martinos members. Latest version of FreeSurfer is ready to use.

Partners Research Computing
ERISOne Linux Cluster Resources: 380 nodes, each with: up to 36 cores (total of ~ 7000 cores) up to 512GB of memory A 3TB-memory server with 64 cores GPUs: 4 × Tesla P100 (+ new V100s) 24 × Tesla M2070 Job scheduler: LSF Pros: Some directories are mounted on the NMR network. High-memory jobs (up to 498GB in the big queue).

Partners Research Computing Windows Analysis Servers
Resources: 2 Windows machines: HPCWin2 (32 cores, 256GB of memory) HPCWin3 (32 cores, 320GB of memory) Connection using the Remote Desktop Protocol: rdesktop hpcwin3.research.partners.org Use PARTNERS\PartnersID to log in. Pros: Run Windows applications. Quick access to MS Office.

Mechanism: Resources: Pros: Email the abstract of the project.
Become their collaborative partner. Resources: GPUs: NVIDIA Deep Learning boxes (DGX-1 with Tesla V100 GPUs) Tesla P100 GPUs Dedicated clusters Pros: Fastest existing GPUs. Perfect for deep learning.

Harvard Medical School Research Computing
Resources: 268 nodes, each with: up to 32 cores (total of 8064 cores, soon: cores) 256GB of memory Soon: 10 high-memory nodes 768GB of memory GPUs 8 × Tesla M40 (4 × 24GB, 4 × 12GB) 16 × Tesla K80 (12GB each) Soon: 16 × Tesla V100 with NVLink Job scheduler: Slurm Pros: Available to both quad & non-quad HMS affiliates (and their RAs). Often underused and not congested. Many Matlab licenses with most toolboxes, including Matlab Distributed Computing Server.

Launchpad

Getting Started with Launchpad
Request access: Login to Launchpad ssh launchpad Need help? Read the documentation:

Submitting and Checking the Status of a Job
pbsubmit -c "echo Started; sleep 30; echo Finished" Opening pbsjob_2 qsub -V -S /bin/sh -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_2 launchpad.nmr.mgh.harvard.edu qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time launchp iman e_defaul pbsjob_ :00 Q -- launchp iman e_defaul pbsjob_ :00 R -- Job Name Job ID Status

Viewing the Output of the Job
jobinfo -o Started qstat -u iman jobinfo JOB INFO FOR : Queued on 09/20/ :26:10 Started on 09/20/ :30:51 Ended on 09/20/ :31:21 Run on host compute-0-57 User is iman Cputime: 00:00:00 Walltime: 00:00:30 (Limit: 96:00:00) Resident Memory: 3520kb Virtual Memory: kb (Limit: 7gb) Exit status: 0 cat /pbs/iman/pbsjob_2.o Finished

Cancelling A Job Cancel a specific job: Cancel all my jobs:
qdel Cancel all my jobs: qselect -u iman | xargs qdel qdel all

Requesting More Resources
1 core ~ 7GB of memory Request 2 cores and 14GB of memory: pbsubmit -n 2 -c "echo Test." Opening pbsjob_3 qsub -V -S /bin/sh -l nodes=1:ppn=2,vmem=14gb -r n /pbs/iman/pbsjob_3 launchpad.nmr.mgh.harvard.edu Request 8 days of wall time (instead of the default 4 days): pbsubmit -q extended -c "echo Test." qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time launchp iman e_extend pbsjob_ :0 Q --

Queues pbsubmit -q max100 -c "echo Test." Queue Name Starting Priority
Max CPU Slots default 10000 150 max500 8800 500 max200 9400 200 max100 100 max75 75 max50 50 max20 20 max10 10 p5 p10 p20 10300 p30 10600 p40 10900 p50 11200 p60 11500 pbsubmit -q max100 -c "echo Test."

Queues GPU pbsubmit -q GPU -c "jobGPU" Opening pbsjob_9
qsub -V -S /bin/sh -q GPU -l nodes=1:GPU:ppn=5 -r n /pbs/iman/pbsjob_9 launchpad.nmr.mgh.harvard.edu jobinfo pbsjob_9 JOB INFO FOR : Queued on 09/21/ :56:03 Started on 09/21/ :56:20 Ended on Run on host compute-0-80 User is iman Cputime: Walltime: (Limit: ) Resident Memory: Virtual Memory: (Limit: ) Exit status:

Queues GPU ssh compute-0-80
Last login: Tue Sep 13 20:34: from launchpad.nmr.mgh.harvard.edu top top - 13:57:21 up 395 days, 17:12, 1 user, load average: 0.99, 0.27, 0.09 Tasks: 254 total, 1 running, 253 sleeping, 0 stopped, 0 zombie Cpu(s): 7.2%us, 5.3%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: k total, k used, k free, k buffers Swap: k total, k used, k free, k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30953 iman g 544m 147m S :54.41 MATLAB 1 root S :13.28 init 2 root S :03.25 kthreadd 3 root RT S :03.52 migration/0 4 root S :07.81 ksoftirqd/0 5 root RT S :00.00 stopper/0 6 root RT S :22.12 watchdog/0 7 root RT S :02.34 migration/1 8 root RT S :00.00 stopper/1 9 root S :02.05 ksoftirqd/1 10 root RT S :09.43 watchdog/1 11 root RT S :17.38 migration/2

Queues highio To deal with the I/O bottleneck:
pbsubmit -q highio -c "jobHighIO" To deal with the I/O bottleneck: During the job’s lifetime, keep the data in: /cluster Create the temporary files in: /cluster/scratch Submit multiple jobs with some delay in between, e.g. by interleaving the sleep command between the job submission commands. Use the highio queue so there are no more than a total of 20 jobs with high I/O running on Launchpad.

Interactive Jobs qsub -I -V -X -q p60
qsub: waiting for job launchpad.nmr.mgh.harvard.edu to start qsub: job launchpad.nmr.mgh.harvard.edu ready hostname compute-0-6 . exit

Email Notifications Email received when the job started running:
pbsubmit –m MartinosID -c "echo Test." pbsubmit -m iman -c "echo Started; sleep 30; echo Finished" Opening pbsjob_12 qsub -V -S /bin/sh -m abe -M iman -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_12 launchpad.nmr.mgh.harvard.edu received when the job started running: PBS Job Id: launchpad.nmr.mgh.harvard.edu Job Name: pbsjob_12 Exec host: compute-0-37/6 Begun execution received when the job ended: Execution terminated Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3516kb resources_used.vmem=321640kb resources_used.walltime=00:00:30

Using Matlab on Launchpad
Matlab licenses are limited! Compile your Matlab code so you can run it without a license: Use the mcc command in Matlab. See JP Coutu’s guide to use deploytool of Matlab: Submit the job to run the compiled executable file.

“NIH Instrumentation Grants
Thank You! If you use Launchpad in your research, please cite the “NIH Instrumentation Grants 1S10RR023401, 1S10RR019307, and 1S10RR023043” in your publication.

High-Performance Computing at the Martinos Center

Similar presentations

Presentation on theme: "High-Performance Computing at the Martinos Center"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Performance Computing at the Martinos Center

Similar presentations

Presentation on theme: "High-Performance Computing at the Martinos Center"— Presentation transcript:

Similar presentations

About project

Feedback