Architecture & System Overview

Slides:



Advertisements
Similar presentations
Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Workshop: Using the VIC3 Cluster for Statistical Analyses Support perspective G.J. Bex.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Job Submission on WestGrid Feb on Access Grid.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Understanding the Basics of Computational Informatics Summer School, Hungary, Szeged Methos L. Müller.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.
 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
CFI 2004 UW A quick overview with lots of time for Q&A and exploration.
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Advanced Computing Facility Introduction
Compute and Storage For the Farm at Jlab
GRID COMPUTING.
Specialized Computing Cluster An Introduction
PARADOX Cluster job management
Computer Science 2 What’s this course all about?
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
HPC usage and software packages
OpenPBS – Distributed Workload Management System
Working With Azure Batch AI
How to use the HPCC to do stuff
GWE Core Grid Wizard Enterprise (
BIOSTAT LINUX CLUSTER By Helen Wang October 29, 2015.
Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node
Introduction to Operating System (OS)
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Chapter III Desktop Imaging Systems & Issues
Introduction to HPC Workshop
ICS 143 Principles of Operating Systems
College of Engineering
Advanced Computing Facility Introduction
High Performance Computing in Bioinformatics
Building and running HPC apps in Windows Azure
Introduction to High Performance Computing Using Sapelo2 at GACRC
Software - Operating Systems
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
Short Read Sequencing Analysis Workshop
Introduction to research computing using Condor
Presentation transcript:

Architecture & System Overview Compute Cluster Architecture & System Overview

What is a Computing Cluster? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. Multiple-core system to process data simultaneously or in concert to reduce overall processing time.

Research Applications Dense or Large Datasets Complex Systems or Processes Multiple Permutations Simulations

Medical Imaging Parallelization

CAMH Computing Cluster (SCC) High Performance Compute Cluster Linux-Based 32 Compute Nodes ~ 1000 Cores GPU Node ~ 5000 Cores Over 120 Software Suites Imaging, Genetics, Electrophysiology, SER MATLAB, STATA High performance storage Over 400 TB Storage Capacity

Advantages of the SCC Computing resources provide more than two orders of magnitude more computing power than current top-end workstations or desktops. Free for All CAMH Researchers Professional support, over 10+ years, bridge the “gap” between science and IT & high performance computing

Shared Resource The system is a shared resource What you do impacts others Responsible usage is critical Queue system automates resource sharing (Fair Share Policy)

Research IT Portal http://ResearchIT.camh.ca

Ganglia – Cluster Monitoring

Access Overview . . . Queue login.scc.camh.net ftp.scc.camh.net LOGIN FTP/NFS Development Nodes SCC STORAGE DEV01 DEV02 Queue NODE03 RESEARCH STORAGE NODE04 NODE05 NODE06 . . . NODE22

‘MobaXterm’ - Login

. . . Queue Login Simple data management No long or demanding jobs Can submit jobs to the queue LOGIN Development Nodes DEV01 Develop Design and test pipelines Mid-range jobs (time/resources) Limited cores available Can submit jobs to the queue DEV02 Queue NODE03 NODE04 Parallelize Run batches on 30+ compute nodes Very few limitations Run heavy jobs here! NODE05 NODE06 . . . NODE22

Data Transfer (SCC-FTP) Use for large data transfers Compress data BEFORE transfer! Save bandwidth during transfer Compression is taxing on the system ftp.scc.camh.net sftp, rsync, scp FTP/NFS STORAGE

SCC FileSystem . . . NODE03 Storage NODE04 NODE05 NODE06 /imaging /home /scratch /genome NODE22 Network File System (NFS) Shared across nodes Increase in IO Operations ↑ Decrease in responsiveness ↓

SCC Queue Queue is not strictly first in first out (FIFO) Scheduling system will monitor and set priorities User A ~ 100 jobs User B ~ 10 jobs Requested resources must exist otherwise jobs are held User A demoted User B promoted

Usage Policies Do not run jobs on the login node They will be killed automatically (via limits) Use development nodes to design & test Use the queue wherever possible Be mindful of IO demands Compress data before transfer to the SCC Consider impact during data transfer (rsync)

Using the Queue . . . Jobs are submitted by “submission script” The queue interprets it and distributes the jobs The script contains or points to a ‘main script’ NODE03 Queue Management System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=1 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 NODE05 NODE06 . . . NODE32

Submission Scripts #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=1 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt “Using bash” PBS Directives Working Directory Main Command

Using the Queue . . . Multi-Tenancy NODE03 NODE03 NODE03 Queue Management System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=6 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=6 #PBS –l walltime=1:00:00 #PBS -N test 2 #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE05 NODE06 . . . NODE32 NODE03

Using the Queue . . . NODE03 NODE03 NODE04 NODE04 Queue Management System #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=12 #PBS –l walltime=1:00:00 #PBS -N test #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE04 NODE04 #!/bin/bash -l #PBS -l nodes=1 #PBS –l ppn=12 #PBS –l walltime=1:00:00 #PBS -N test 2 #PBS -V cd $PBS_O_WORKDIR echo “hello” > world.txt NODE05 NODE06 . . . NODE32 NODE03

RAM Considerations Different programs require differing amounts of memory Example: Two processes could take 90%, with 10% left for a remaining 10 NODE03 RAM 90% In this case jobs can fail because they lack resources Solution 1: Define the memory you require in the submission script Solution 2: Request 12 processors, since other jobs would not work

Multi-Tenancy Attempts to maximize resource usage Use all processors where available Requires responsible usage from all parties Do not ask for 12 ppn, if you only need 1 ppn (*) Be conscious of your RAM requirements Users cannot log in to compute nodes directly This greatly interferes with job scheduling

SCC Queues The PBS scheduler will select the most appropriate queue for use. ‘short’ queues have higher priority but have max-walltime / max-node restrictions. ‘intq’ queue is a special standing reservation queue.   short1n short medium1n medium long1n long verylong1n verylong intq Priority 300 250 200 150 100 50 20 10 -- Max Cores/user 300-400 200-300 100-200 Max Node 1 5 Min Walltime 0:00:01 8:00:01 24:00:01 48:00:01 Max Walltime 8:00:00 24:00:00 48:00:00 Maxjob/user 4 The PBS will reserve two nodes during working hours (Mon-Fri, 9am-5pm) for interactive PBS session use. You need to specify this queue type to use the reserved resources. Since these reserved resources are shared, please restrict yourself to maximum of two cpus for testing/debugging your mpi program.

Job scheduling on the cluster Batch Systems Compute node Compute node Compute node Compute node Compute node B Slot 1 C Slot 1 C Slot 2 A Slot 1 B Slot 1 C Slot 1 A Slot 1 A Slot 2 B Slot 1 C Slot 1 C Slot 2 C Slot 3 B Slot 1 B Slot 2 B Slot 3 Queue-A Queue-B Queue-C MASTER node Torque: Resource manager Moab: Scheduler Queues Policies Priorities Share/Tickets Resources Users/Projects JOB Y JOB Z JOB X JOB O JOB N JOB U Job scheduling on the cluster

Browser Access http://login.scc.camh.net/pbs/

PBSWeb-Lite http://login.scc.camh.net/pbs/

http://info2.camh.net/scc/index.php

Advanced SCC What will be covered… Introduction Text Editors   Introduction Text Editors Executable Scripts Loops While Loops For Loops Until Loops Fork – Child Processes GNU Parallel Queue Submission Distributed Job QBatch – Parallel Made Easy

Thank You!

Physical View