High-Performance Computing at the Martinos Center

Slides:

Advertisements

Similar presentations

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Advertisements

Network for Computational Nanotechnology (NCN) Purdue, Norfolk State, Northwestern, UC Berkeley, Univ. of Illinois, UTEP Basic Portable Batch System (PBS)

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.

Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh ssh.fsl.byu.edu You will be logged in to an interactive node.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

LINUX System : Lecture 2 OS and UNIX summary Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Acknowledgement.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Fall 2011 Nassau Community College ITE153 – Operating Systems 1 Session 4 More Hands-on Commands.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.

Remote & Collaborative Visualization. TACC Remote Visualization Systems Longhorn – Dell XD Visualization Cluster –256 nodes, each with 48 GB (or 144 GB)

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

Introduction to HPC Workshop March 1 st, Introduction George Garrett & The HPC Support Team Research Computing Services CUIT.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Using ROSSMANN to Run GOSET Studies Omar Laldin ( using materials from Jonathan Crider, Harish Suryanarayana ) Feb. 3, 2014.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

CFI 2004 UW A quick overview with lots of time for Q&A and exploration.

Linux I/O Tuning Anshul Gupta. 2 - Performance Gathering Tools -I/O Elevators -File System -Swap and Caches Agenda.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

Advanced Computing Facility Introduction

Where'd all my memory go? Joshua Miller SCALE 12x – 22 FEB 2014.

Workstations & Thin Clients

GRID COMPUTING.

Welcome to Indiana University Clusters

PARADOX Cluster job management

Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.

HPC usage and software packages

OpenPBS – Distributed Workload Management System

Linux203Training Module System Mgmt.

Welcome to Indiana University Clusters

Process Management Process Concept Why only the global variables?

How to use the HPCC to do stuff

Heterogeneous Computation Team HybriLIT

Using Paraguin to Create Parallel Programs

Linux 202 Training Module Program and Process.

Joker: Getting the most out of the slurm scheduler

Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node

Architecture & System Overview

CommLab PC Cluster (Ubuntu OS version)

BIMSB Bioinformatics Coordination

Resource Management for High-Throughput Computing at the ESRF G

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Chapter 1: Introduction

Welcome to our Nuclear Physics Computing System

Introduction to HPC Workshop

Bruce Pullig Solution Architect

Introduction to TAMNUN server and basics of PBS usage

Haiyan Meng and Douglas Thain

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

Welcome to our Nuclear Physics Computing System

Advanced Computing Facility Introduction

Requesting Resources on an HPC Facility

High Performance Computing in Bioinformatics

Introduction to High Performance Computing Using Sapelo2 at GACRC

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Presentation transcript:

High-Performance Computing at the Martinos Center Iman Aganj Why & How September 20, 2018

High-Performance Computing (HPC) What? The use of advanced (usually shared) computing resources to solve large computational problems quickly and efficiently. Why? Processing of large datasets in parallel. Access to remote resources that aren’t locally available, e.g., Big chunks of memory GPUs How? Remote job submission.

What HPC resources are available to us? Martinos Center Compute Cluster: Launchpad Icepuffs Partners Research Computing: ERISOne Linux Cluster Windows Analysis Servers MGH & BWH Center for Clinical Data Science Harvard Medical School Research Computing: O2 External: Open Science Grid Mass Open Cloud

Martinos Center Compute Cluster Launchpad Resources: 105 nodes, each with: 8 cores (total of ~ 840 cores) 56GB of memory GPUs: 7 × Tesla M2050 Job scheduler: PBS www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php

Martinos Center Compute Cluster Icepuffs Resources: 3 Icepuff nodes, each with: 64 cores 256GB of memory Pros (Launchpad & Icepuffs): NMR network folders are already mounted. Exclusive to Martinos members. Latest version of FreeSurfer is ready to use. www.nmr.mgh.harvard.edu/martinos/userInfo/computer/icepuffs.php

Partners Research Computing ERISOne Linux Cluster Resources: 380 nodes, each with: up to 36 cores (total of ~ 7000 cores) up to 512GB of memory A 3TB-memory server with 64 cores GPUs: 4 × Tesla P100 (+ new V100s) 24 × Tesla M2070 Job scheduler: LSF Pros: Some directories are mounted on the NMR network. High-memory jobs (up to 498GB in the big queue). https://rc.partners.org/kb/article/2164

Partners Research Computing Windows Analysis Servers Resources: 2 Windows machines: HPCWin2 (32 cores, 256GB of memory) HPCWin3 (32 cores, 320GB of memory) Connection using the Remote Desktop Protocol: rdesktop hpcwin3.research.partners.org Use PARTNERS\PartnersID to log in. Pros: Run Windows applications. Quick access to MS Office. https://rc.partners.org/kb/computational-resources/windows-analysis-servers?article=2652

Mechanism: Resources: Pros: Email the abstract of the project. Become their collaborative partner. Resources: GPUs: NVIDIA Deep Learning boxes (DGX-1 with Tesla V100 GPUs) Tesla P100 GPUs Dedicated clusters Pros: Fastest existing GPUs. Perfect for deep learning. www.ccds.io

Harvard Medical School Research Computing Resources: 268 nodes, each with: up to 32 cores (total of 8064 cores, soon: 11000 cores) 256GB of memory Soon: 10 high-memory nodes 768GB of memory GPUs 8 × Tesla M40 (4 × 24GB, 4 × 12GB) 16 × Tesla K80 (12GB each) Soon: 16 × Tesla V100 with NVLink Job scheduler: Slurm Pros: Available to both quad & non-quad HMS affiliates (and their RAs). Often underused and not congested. Many Matlab licenses with most toolboxes, including Matlab Distributed Computing Server. https://wiki.rc.hms.harvard.edu/display/O2

Launchpad www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php

Getting Started with Launchpad Request access: Email: help@nmr.mgh.harvard.edu Login to Launchpad ssh launchpad Need help? Read the documentation: www.nmr.mgh.harvard.edu/martinos/userInfo/computer/launchpad.php Email: batch-users@nmr.mgh.harvard.edu

Submitting and Checking the Status of a Job pbsubmit -c "echo Started; sleep 30; echo Finished" Opening pbsjob_2 qsub -V -S /bin/sh -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_2 14779540.launchpad.nmr.mgh.harvard.edu qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 14779540.launchp iman e_defaul pbsjob_2 -- 1 1 -- 96:00 Q -- 14779540.launchp iman e_defaul pbsjob_2 3472 1 1 -- 96:00 R -- Job Name Job ID Status

Viewing the Output of the Job jobinfo -o 14779540 Started qstat -u iman jobinfo 14779540 JOB INFO FOR 14779540: Queued on 09/20/2017 18:26:10 Started on 09/20/2017 18:30:51 Ended on 09/20/2017 18:31:21 Run on host compute-0-57 User is iman Cputime: 00:00:00 Walltime: 00:00:30 (Limit: 96:00:00) Resident Memory: 3520kb Virtual Memory: 321640kb (Limit: 7gb) Exit status: 0 cat /pbs/iman/pbsjob_2.o14779540 Finished

Cancelling A Job Cancel a specific job: Cancel all my jobs: qdel 14779540 Cancel all my jobs: qselect -u iman | xargs qdel qdel all

Requesting More Resources 1 core ~ 7GB of memory Request 2 cores and 14GB of memory: pbsubmit -n 2 -c "echo Test." Opening pbsjob_3 qsub -V -S /bin/sh -l nodes=1:ppn=2,vmem=14gb -r n /pbs/iman/pbsjob_3 14783620.launchpad.nmr.mgh.harvard.edu Request 8 days of wall time (instead of the default 4 days): pbsubmit -q extended -c "echo Test." qstat -u iman launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 14783622.launchp iman e_extend pbsjob_4 -- 1 1 -- 196:0 Q --

Queues pbsubmit -q max100 -c "echo Test." Queue Name Starting Priority Max CPU Slots default 10000 150 max500 8800 500 max200 9400 200 max100 100 max75 75 max50 50 max20 20 max10 10 p5 p10 p20 10300 p30 10600 p40 10900 p50 11200 p60 11500 pbsubmit -q max100 -c "echo Test."

Queues GPU pbsubmit -q GPU -c "jobGPU" Opening pbsjob_9 qsub -V -S /bin/sh -q GPU -l nodes=1:GPU:ppn=5 -r n /pbs/iman/pbsjob_9 14783690.launchpad.nmr.mgh.harvard.edu jobinfo pbsjob_9 JOB INFO FOR 14783690: Queued on 09/21/2017 13:56:03 Started on 09/21/2017 13:56:20 Ended on Run on host compute-0-80 User is iman Cputime: Walltime: (Limit: ) Resident Memory: Virtual Memory: (Limit: ) Exit status:

Queues GPU ssh compute-0-80 Last login: Tue Sep 13 20:34:39 2016 from launchpad.nmr.mgh.harvard.edu top top - 13:57:21 up 395 days, 17:12, 1 user, load average: 0.99, 0.27, 0.09 Tasks: 254 total, 1 running, 253 sleeping, 0 stopped, 0 zombie Cpu(s): 7.2%us, 5.3%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32879188k total, 5153368k used, 27725820k free, 269984k buffers Swap: 67108860k total, 0k used, 67108860k free, 3756420k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30953 iman 20 0 47.5g 544m 147m S 99.7 1.7 0:54.41 MATLAB 1 root 20 0 25676 1672 1324 S 0.0 0.0 0:13.28 init 2 root 20 0 0 0 0 S 0.0 0.0 0:03.25 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 8:03.52 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:07.81 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0 6 root RT 0 0 0 0 S 0.0 0.0 0:22.12 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:02.34 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1 9 root 20 0 0 0 0 S 0.0 0.0 0:02.05 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:09.43 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:17.38 migration/2

Queues GPU ssh compute-0-80 Last login: Tue Sep 13 20:34:39 2016 from launchpad.nmr.mgh.harvard.edu nvidia-smi Thu Sep 21 13:57:01 2017 +------------------------------------------------------+ | NVIDIA-SMI 361.28 Driver Version: 361.28 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M2050 Off | 0000:0B:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 1628MiB / 2687MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 30953 C ...ofs/cluster/matlab/8.6/bin/glnxa64/MATLAB 1620MiB |

Queues highio To deal with the I/O bottleneck: pbsubmit -q highio -c "jobHighIO" To deal with the I/O bottleneck: During the job’s lifetime, keep the data in: /cluster Create the temporary files in: /cluster/scratch Submit multiple jobs with some delay in between, e.g. by interleaving the sleep command between the job submission commands. Use the highio queue so there are no more than a total of 20 jobs with high I/O running on Launchpad.

Interactive Jobs qsub -I -V -X -q p60 qsub: waiting for job 14783853.launchpad.nmr.mgh.harvard.edu to start qsub: job 14783853.launchpad.nmr.mgh.harvard.edu ready hostname compute-0-6 . exit

Email Notifications Email received when the job started running: pbsubmit –m MartinosID -c "echo Test." pbsubmit -m iman -c "echo Started; sleep 30; echo Finished" Opening pbsjob_12 qsub -V -S /bin/sh -m abe -M iman -l nodes=1:ppn=1,vmem=7gb -r n /pbs/iman/pbsjob_12 14783855.launchpad.nmr.mgh.harvard.edu Email received when the job started running: PBS Job Id: 14783855.launchpad.nmr.mgh.harvard.edu Job Name: pbsjob_12 Exec host: compute-0-37/6 Begun execution Email received when the job ended: Execution terminated Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3516kb resources_used.vmem=321640kb resources_used.walltime=00:00:30

Using Matlab on Launchpad Matlab licenses are limited! Compile your Matlab code so you can run it without a license: Use the mcc command in Matlab. See JP Coutu’s guide to use deploytool of Matlab: http://nmr.mgh.harvard.edu/martinos/itgroup/deploytool.html Submit the job to run the compiled executable file.

“NIH Instrumentation Grants Thank You! If you use Launchpad in your research, please cite the “NIH Instrumentation Grants 1S10RR023401, 1S10RR019307, and 1S10RR023043” in your publication.