Architecture & System Overview

Architecture & System Overview
Compute Cluster Architecture & System Overview

What is a Computing Cluster?
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. Multiple-core system to process data simultaneously or in concert to reduce overall processing time.

Research Applications
Dense or Large Datasets Complex Systems or Processes Multiple Permutations Simulations

Medical Imaging Parallelization

CAMH Computing Cluster (SCC)
High Performance Compute Cluster Linux-Based 32 Compute Nodes ~ 1000 Cores GPU Node ~ 5000 Cores Over 120 Software Suites Imaging, Genetics, Electrophysiology, SER MATLAB, STATA High performance storage Over 400 TB Storage Capacity

Advantages of the SCC Computing resources provide more than two orders of magnitude more computing power than current top-end workstations or desktops. Free for All CAMH Researchers Professional support, over 10+ years, bridge the “gap” between science and IT & high performance computing

Shared Resource The system is a shared resource
What you do impacts others Responsible usage is critical Queue system automates resource sharing (Fair Share Policy)

Research IT Portal

Ganglia – Cluster Monitoring

Access Overview . . . Queue login.scc.camh.net ftp.scc.camh.net LOGIN
FTP/NFS Development Nodes SCC STORAGE DEV01 DEV02 Queue NODE03 RESEARCH STORAGE NODE04 NODE05 NODE06 . . . NODE22

‘MobaXterm’ - Login

. . . Queue Login Simple data management No long or demanding jobs
Can submit jobs to the queue LOGIN Development Nodes DEV01 Develop Design and test pipelines Mid-range jobs (time/resources) Limited cores available Can submit jobs to the queue DEV02 Queue NODE03 NODE04 Parallelize Run batches on 30+ compute nodes Very few limitations Run heavy jobs here! NODE05 NODE06 . . . NODE22

Data Transfer (SCC-FTP)
Use for large data transfers Compress data BEFORE transfer! Save bandwidth during transfer Compression is taxing on the system ftp.scc.camh.net sftp, rsync, scp FTP/NFS STORAGE

SCC FileSystem . . . NODE03 Storage NODE04 NODE05 NODE06 /imaging
/home /scratch /genome NODE22 Network File System (NFS) Shared across nodes Increase in IO Operations ↑ Decrease in responsiveness ↓

SCC Queue Queue is not strictly first in first out (FIFO)
Scheduling system will monitor and set priorities User A ~ 100 jobs User B ~ 10 jobs Requested resources must exist otherwise jobs are held User A demoted User B promoted

Usage Policies Do not run jobs on the login node
They will be killed automatically (via limits) Use development nodes to design & test Use the queue wherever possible Be mindful of IO demands Compress data before transfer to the SCC Consider impact during data transfer (rsync)

Using the Queue . . . Jobs are submitted by “submission script”
The queue interprets it and distributes the jobs The script contains or points to a ‘main script’ NODE03 Queue Management System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=1 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 NODE05 NODE06 . . . NODE32

Submission Scripts #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=1 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt “Using bash” PBS Directives Working Directory Main Command

Using the Queue . . . Multi-Tenancy NODE03 NODE03 NODE03 Queue
Management System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=6 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=6 #PBS –l walltime=1:00:00 #PBS -N test 2 #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE05 NODE06 . . . NODE32 NODE03

Using the Queue . . . NODE03 NODE03 NODE04 NODE04 Queue Management
System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=12 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 NODE04 #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=12 #PBS –l walltime=1:00:00 #PBS -N test 2 #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE05 NODE06 . . . NODE32 NODE03

RAM Considerations Different programs require differing amounts of memory Example: Two processes could take 90%, with 10% left for a remaining 10 NODE03 RAM 90% In this case jobs can fail because they lack resources Solution 1: Define the memory you require in the submission script Solution 2: Request 12 processors, since other jobs would not work

Multi-Tenancy Attempts to maximize resource usage
Use all processors where available Requires responsible usage from all parties Do not ask for 12 ppn, if you only need 1 ppn (*) Be conscious of your RAM requirements Users cannot log in to compute nodes directly This greatly interferes with job scheduling

SCC Queues The PBS scheduler will select the most appropriate queue for use. ‘short’ queues have higher priority but have max-walltime / max-node restrictions. ‘intq’ queue is a special standing reservation queue. short1n short medium1n medium long1n long verylong1n verylong intq Priority 300 250 200 150 100 50 20 10 -- Max Cores/user Max Node 1 5 Min Walltime 0:00:01 8:00:01 24:00:01 48:00:01 Max Walltime 8:00:00 24:00:00 48:00:00 Maxjob/user 4 The PBS will reserve two nodes during working hours (Mon-Fri, 9am-5pm) for interactive PBS session use. You need to specify this queue type to use the reserved resources. Since these reserved resources are shared, please restrict yourself to maximum of two cpus for testing/debugging your mpi program.

Job scheduling on the cluster
Batch Systems Compute node Compute node Compute node Compute node Compute node B Slot 1 C Slot 1 C Slot 2 A Slot 1 B Slot 1 C Slot 1 A Slot 1 A Slot 2 B Slot 1 C Slot 1 C Slot 2 C Slot 3 B Slot 1 B Slot 2 B Slot 3 Queue-A Queue-B Queue-C MASTER node Torque: Resource manager Moab: Scheduler Queues Policies Priorities Share/Tickets Resources Users/Projects JOB Y JOB Z JOB X JOB O JOB N JOB U Job scheduling on the cluster

Browser Access

PBSWeb-Lite

Advanced SCC What will be covered… Introduction Text Editors
Introduction Text Editors Executable Scripts Loops While Loops For Loops Until Loops Fork – Child Processes GNU Parallel Queue Submission Distributed Job QBatch – Parallel Made Easy

Thank You!

Physical View

Architecture & System Overview

Similar presentations

Presentation on theme: "Architecture & System Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architecture & System Overview

Similar presentations

Presentation on theme: "Architecture & System Overview"— Presentation transcript:

Similar presentations

About project

Feedback