Southgreen HPC system 2014. Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Slides:



Advertisements
Similar presentations
Parasol Architecture A mild case of scary asynchronous system stuff.
Advertisements

Using the Argo Cluster Paul Sexton CS 566 February 6, 2006.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
CCPR Workshop Lexis Cluster Introduction October 19, 2007 David Ash.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
PBS Job Management and Taskfarming Joachim Wagner
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.
ISG We build general capability Job Submission on the Olympus Cluster J. DePasse; S. Brown, PhD; T. Maiden Pittsburgh Supercomputing Center Public Health.
High Performance Computing
DCC/FCUP Grid Computing 1 Resource Management Systems.
Job Submission on WestGrid Feb on Access Grid.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Installing and running COMSOL on a Windows HPCS2008(R2) cluster
HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
ISG We build general capability Purpose After this tutorial, you should: Be comfortable submitting work to the batch queuing system of olympus and be familiar.
Introduction to UNIX/Linux Exercises Dan Stanzione.
 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.
Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Zellescher Weg 12 Trefftz-Building – HRSK/151 Phone Guido Juckeland Center for Information Services.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Advanced SCC Usage Research Computing Services Katia Oleinik
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
Submitting Jobs to the Sun Grid Engine at Sheffield and Leeds (Node1)
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
Advanced SCC Usage Research Computing Services Katia Oleinik
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
1 High-Performance Grid Computing and Research Networking Presented by Javier Delgodo Slides prepared by David Villegas Instructor: S. Masoud Sadjadi
Advanced Computing Facility Introduction
GRID COMPUTING.
Welcome to Indiana University Clusters
PARADOX Cluster job management
Unix Scripts and PBS on BioU
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
HPC usage and software packages
Welcome to Indiana University Clusters
Using Paraguin to Create Parallel Programs
BIOSTAT LINUX CLUSTER By Helen Wang October 29, 2015.
BIMSB Bioinformatics Coordination
Cluster Usage Session: NBCR clusters introduction August 3, 2007
Introduction to HPC Workshop
Bruce Pullig Solution Architect
Compiling and Job Submission
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Requesting Resources on an HPC Facility
Sun Grid Engine.
gLite Job Management Christos Theodosiou
High-Performance Grid Computing and Research Networking
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
Presentation transcript:

Southgreen HPC system 2014

Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal” (marmadais.cirad.fr) Job Manager (SGE) : software that allows you to initiate and/or send jobs to the cluster compute servers (also known as compute hosts or nodes)

Architecture

Usefull links Available woftware : er.xls Software installation form :

Do not run jobs on marmadais Jobs or programs found running on marmadais could be killed. Submit batch jobs (qsub) or, if really needed, work on a cluster node (qrsh or qlogin)

Compute node characteristics CPU x86_64 cc-adm-calcul-1, …, cc-adm-calcul-23 : 8 cores, 32G RAM cc-adm-calcul-24, …, cc-adm-calcul-26 : 8 cores, 92G RAM cc-adm-calcul-27 : 32 cores, 1To RAM

Type of jobs/programs Sequential : use only 1 core. Multi-thread : use multiple CPU cores via program threads on a single cluster node. Parallel (MPI) : use many CPU cores on many cluster nodes using a Message Passing Interface (MPI) environment.

File systems home2 : GPFS (on slow disks) Work : GPFS (on quick disks) All the rest : NFS

Step by step Preparation : Data Script Submit Job computation Checkout

Using the cluster All computing on the cluster is done by logging into marmadais.cirad.fr and submitting batch jobs or interactive. Job submission : qsub, qrsh, qsh/qlogin Job management : qmod, qdel, qhold, qrls, qalter Cluster information display : qstat, qhost, qconf Accountig : qacct

Interactive jobs qrsh establishes a remote shell connection (ssh) on a cluster node which meets your requirements. qlogin (bigmem.q) establishes a remote shell connection (ssh) with X display export on a cluster node which meets your requirements.

A first look at batch job submission create a shell script vi script1.sh # create a shell script! The shell script hostname # show name of cluster node this job is running on! ps -ef # see what’s running (on the cluster node)! sleep 30 # sleep for 30 seconds! echo -e “\n Done”! submit shell script with a given job name (note use of -N option) qsub -N mytest script1.sh check the status of your job(s) qstat! The output (stdout and stderr) goes to files in your home directory ls mytest.o* mytest.e*

A few options run the job in the current working directory (where qsub was executed) rather than the default (home directory) -cwd send standard output (error) stream to a different file -o path/filename -e path/filename merge the standard error stream into the standard output -j y

Memory limitations Resources are limited and shared … so specify your resource requirements Ask for the amount of memory that you expect your job will need so that it runs on a cluster node with sufficient memory available … AND … place a limit on the amount of memory your job can use so that you don’t accidentally crash a compute node and take down other users’ jobs along with yours: qsub -cwd -l mem_free=6G,h_vmem=8G myscript.sh qrsh -l mem_free=8G,h_vmem=10G If not set, we do it for you.

Job simple Template : /usr/local/bioinfo/template/job_sequentiel.sh

Job multi-thread Template : /usr/local/bioinfo/template/job_multithread.sh

Job MPI Template : /usr/local/bioinfo/template/job_parallel.sh

Parallel env parallel_2 parallel_4 parallel_8 parallel_fill parallel_rr parallel_smp

How many jobs can I submit ? As many as you want … BUT, only a limited number of “slots” will run. The rest will have the queue wait state ‘qw’ and will run as your other jobs finish. In SGE, a slot generally corresponds to a single cpu-core. The maximum number of slots per user may change depending on the availability of cluster resources or special needs and requests.

queues normal.q 32Go RAM nodes Access : external users Runtime: maximum 48 hours Normal priority urgent.q Any node Access : limited Runtime: maximum 24 hours Highest priority hypermem.q 1To RAM node Access : everybody Runtime: no limit Normal priority bigmem.q 92Go RAM nodes Access : everybody Runtime: no limit Normal priority bioinfo.q 32Go RAM nodes Access : internal users Runtime: maximum 48 hours Normal priority long.q For long running jobs (32Go RAM nodes) Access : everybody Runtime: no limit Low priority

qstat info for a given user qstat -u username (or just qstat or qu for your jobs) full dump of queue and job status qstat -f What do the column labels mean? job-ID a unique identifier for the job name the name of the job state the state of the job r: running s: suspended t: being transferred to an execution host qw: queued and waiting to run Eqw: an error occured with the job Why is my job in Eqw state? qstat -j job-ID -explain E

Visualisation

qdel To terminate a job, first get the job-id with qstat qstat (or qu ) Terminate the job qdel job-id Forced termination of a running job (admins only) qdel -f job-id