Research CU Boulder Additional Topics in Research Computing Shelley Knuth Peter Ruprecht

Slides:

Advertisements

Similar presentations

Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.

Advertisements

PRAKTICKÝ ÚVOD DO SUPERPOČÍTAČE ANSELM Infrastruktura, přístup a podpora uživatelů David Hrbáč

Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Fall 2006.

May 12, 2015 XSEDE New User Tutorials and User Support: Lessons Learned Marcela Madrid.

WEST VIRGINIA UNIVERSITY HPC and Scientific Computing AN OVERVIEW OF HIGH PERFORMANCE COMPUTING RESOURCES AT WVU.

Information Technology Center Introduction to High Performance Computing at KFUPM.

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME SESAME – LinkSCEEM.

Job Submission on WestGrid Feb on Access Grid.

Academic and Research Technology (A&RT)

Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Tuesday, September 08, Head Node – Magic.cse.buffalo.edu Hardware Profile Model – Dell PowerEdge 1950 CPU - two Dual Core Xeon Processors (5148LV)

HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Illinois Campus Cluster Program User Forum October 24, 2012 Illini Union Room 210 2:00PM – 3:30PM.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Introduction to UNIX/Linux Exercises Dan Stanzione.

Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Research Support Services Research Support Services.

Introduction to HPC resources for BCB 660 Nirav Merchant

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to the HPCC Jim Leikert System Administrator High Performance Computing Center.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

Introduction to the HPCC Dirk Colbry Research Specialist Institute for Cyber Enabled Research.

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

Lab System Environment

Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.

CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.

Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.

Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

CCS Overview Rene Salmon Center for Computational Science.

The LBNL Perceus Cluster Infrastructure Next Generation Cluster Provisioning and Management October 10, 2007 Internet2 Fall Conference Gary Jung, SCS Project.

Computer Software Types Three layers of software Operation.

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

CSC190 Introduction to Computing Operating Systems and Utility Programs.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Remote & Collaborative Visualization. TACC Remote Visualization Systems Longhorn – Dell XD Visualization Cluster –256 nodes, each with 48 GB (or 144 GB)

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

Advanced Computing Facility Introduction

Workstations & Thin Clients

Hands on training session for core skills

Specialized Computing Cluster An Introduction

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

HPC usage and software packages

Working With Azure Batch AI

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,

Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node

Architecture & System Overview

College of Engineering

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

Advanced Computing Facility Introduction

High Performance Computing in Bioinformatics

Types of Software.

Parallel computation with R & Python on TACC HPC server

Introduction to High Performance Computing Using Sapelo2 at GACRC

Software - Operating Systems

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Presentation transcript:

Research CU Boulder Additional Topics in Research Computing Shelley Knuth Peter Ruprecht

Research CU Boulder Outline Brief overview of Janus Update on replacement for Janus Alternate compute environment – condo PetaLibrary storage General supercomputing information Queues and job submission Keeping the system happy Using and installing software Compiling and submission examples Time for questions, feedback, and one-on-one UCD/AMC 2 4/3/15

Research CU Boulder What does Research Computing do? We manage Shared large scale compute resources Large scale storage High-speed network without firewalls – ScienceDMZ Software and tools We provide Consulting support for building scientific workflows on the RC platform Training Data management support in collaboration with the Libraries UCD/AMC 3 4/3/15

Research CU Boulder Research Computing - Contact Main contact Will go into our ticket system - main web sitehttp:// - data management resourcehttp://data.colorado.edu Mailing Lists (announcements and collaboration) Facebookhttps:// putinghttps:// puting Meetuphttp:// Computational-Science-and-Engineering/ Computational-Science-and-Engineering/ UCD/AMC 4 4/3/15

Research CU Boulder JANUS SUPERCOMPUTER AND OTHER COMPUTE RESOURCES UCD/AMC 5 4/3/15

Research CU Boulder What Is a Supercomputer? CU’s supercomputer is one large computer made up of many smaller computers and processors Each different computer is called a node Each node has processors/cores Carry out the instructions of the computer With a supercomputer, all these different computers talk to each other through a communications network On Janus – InfiniBand UCD/AMC 6 4/3/15

Research CU Boulder Why Use a Supercomputer? Supercomputers give you the opportunity to solve problems that are too complex for the desktop Might take hours, days, weeks, months, years If you use a supercomputer, might only take minutes, hours, days, or weeks Useful for problems that require large amounts of memory UCD/AMC 7 4/3/15

Research CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 8 4/3/15

Research CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 9 4/3/15

Research CU Boulder Hardware - Janus Supercomputer 1368 compute nodes (Dell C6100) 16,428 total cores No battery backup of the compute nodes Fully non-blocking QDR Infiniband network 960 TB of usable Lustre based scratch storage GB/s max throughput UCD/AMC 10 4/3/15

Research CU Boulder Different Node Types Login nodes This is where you are when you log in to login.rc.colorado.edu No heavy computation, interactive jobs, or long running processes Script or code editing, minor compiling Job submission Compile nodes Compiling code Job submission Compute/batch nodes This is where jobs that are submitted through the scheduler run Intended for heavy computation UCD/AMC 11 4/3/15

Research CU Boulder Additional Compute Resources 2 Graphics Processing Unit (GPU) Nodes Visualization of data Exploring GPUs for computing 4 High Memory Nodes (“himem”) 1 TB of memory, cores per node 16 Blades for long running jobs (“crestone”) 2-week walltimes allowed 96 GB of memory (4 times more compared to a Janus node) UCD/AMC 12 4/3/15

Research CU Boulder Replacing Janus Janus goes off warranty in July, but we expect to run it (or part of it) for another 9-12 months after that. Have applied for a National Science Foundation grant for a replacement cluster. About 3x the FLOPS performance of Janus Approx 350 compute nodes, 24 cores, 128 GB RAM Approx 20 nodes with GPU coprocessors Approx 20 nodes with Xeon Phi processors Scratch storage suited to a range of applications Decision from NSF expected by early Sept. Installation would begin early in UCD/AMC 13 4/3/15

Research CU Boulder “Condo” Cluster CU provides infrastructure such as network, data center space, and scratch storage Partners buy nodes that fit into the infrastructure Partners get priority access on their nodes Other condo partners can backfill onto nodes not currently in use Node options will be updated with current hardware each year A given generation of nodes will run 5 years Little or no customization of OS and hardware options UCD/AMC 14 4/3/15

Research CU Boulder Condo Node Options Compute node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet Potential uses: general purpose high-performance computation that doesn’t require low-latency parallel communication between nodes $6,875 GPU node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet 2x NVIDIA K20M GPU coprocessors Potential uses: molecular dynamics, image processing $12,509 UCD/AMC 15 4/3/15

Research CU Boulder Node Options – cont’d High-memory node 4x 12-core 3.0 GHz Intel “Ivy Bridge” processors 1024 GB RAM 10x 1 TB 7200 RPM hard drives in high-performance RAID configuration 10 gigabit/s Ethernet Potential uses: genetics/genomics or finite-element applications $34,511 Notes: Dell is providing bulk discount pricing even for incremental purchases 5-year warranty Network infrastructure in place to support over 100 nodes at 10 Gbps UCD/AMC 16 4/3/15

Research CU Boulder Condo Scheduling Overview All jobs are run through a batch/queue system (Slurm.) Interactive jobs on compute nodes are allowed but these must be initiated through the scheduler. Each partner group has its own high-priority queue for jobs that will run on nodes that it has purchased. High-priority jobs are guaranteed to start running within 24 hours (unless other high-priority jobs are already running) and can run for a maximum wall time of 14 days. All partners also have access to a low-priority queue that can run on any condo nodes that are not already in use by their owners. UCD/AMC 17 4/3/15

Research CU Boulder RMACC conference Rocky Mountain Advanced Computing Consortium August 11-13, 2015, Wolf Law Building, CU- Boulder campus Hands on and lecture tutorials about HPC, data, and technical topics Educational and student focused panels Poster session for students UCD/AMC 18 4/3/15

Research CU Boulder DATA STORAGE AND TRANSFER UCD/AMC 19 4/3/15

Research CU Boulder Storage Spaces Home Directories Not high performance; not for direct computational output 2 GB quota; snapshotted frequently /home/user1234 Project Spaces Not high performance; can be used to store or share programs, input files, maybe small data files 250 GB quota; snapshotted regularly /projects/user1234 Lustre Parallel Scratch Filesystem No hard quotas; no backups Files created more than 180 days in the past may be purged at any time /lustre/janus_scratch/user1234 UCD/AMC 20 4/3/15

Research CU Boulder Research Data Storage: PetaLibrary Long term, large scale storage option “Active” or “Archive” Primary storage system for condo Modest charge to use Provide expertise and services around this storage Data management Consulting UCD/AMC 21 4/3/15

Research CU Boulder PetaLibrary: Cost to Researchers Additional free options Snapshots: on-disk incremental backups Sharing through Globus Online 22 UCD/AMC4/3/15

Research CU Boulder Keeping Lustre Happy Janus Lustre is tuned for large parallel I/O operations. Creating, reading, writing, or removing many small files simultaneously can cause performance problems. Don’t put more than 10,000 files in a single directory. Avoid “ls –l” in a large directory. Avoid shell wildcard expansions ( * ) in large directories. UCD/AMC 23 4/3/15

Research CU Boulder Data Sharing and Transfers Globus tools: Globus Online and gridftp Easier external access without CU-RC account, especially for external collaborators Endpoint: colorado#gridftp SSH: scp, sftp, rsync Adequate for smaller transfers Connections to RC resources are limited by UCD bandwidth UCD/AMC 24 4/3/15

Research CU Boulder ACCESSING AND UTILIZING RC RESOURCES UCD/AMC 25 4/3/15

Research CU Boulder Initial Steps to Use RC Systems Apply for an RC account Read Terms and Conditions Choose “University of Colorado – Denver” Use your UCD address Get a One-Time Password device Apply for a computing allocation Initially you will be added to UCD’s existing umbrella allocation Subsequently will need a brief proposal for additional time UCD/AMC 26 4/3/15

Research CU Boulder Logging In SSH to login.rc.colorado.edu Use your RC username and One-Time Password From a login node, can SSH to a compile node - janus-compile1(2,3,4) - to build your programs All RC nodes use RedHat Enterprise Linux 6 UCD/AMC 27 4/3/15

Research CU Boulder Job Scheduling On supercomputer, jobs are scheduled rather than just run instantly at the command line Several different scheduling software packages are in common use Licensing, performance, and functionality issues have caused us to choose Slurm SLURM = Simple Linux Utility for Resource Management Open source Increasingly popular at other sites More at UCD/AMC 28 4/3/15

Research CU Boulder Running Jobs Do NOT compute on login or compile nodes Interactive jobs Work interactively at the command line of a compute node Command: salloc --qos=janus-debug sinteractive --qos=janus-debug Batch jobs Submit job that will be executed when resources are available Create a text file containing information about the job Submit the job file to a queue sbatch --qos= file UCD/AMC 29 4/3/15

Research CU Boulder Quality of Service = Queues janus-debug Only debugging – no production work Maximum wall time 1 hour, 2 jobs per user (The maximum amount of time your job is allowed to run) janus Default Maximum wall time of 24 hours janus-long Maximum wall time of 168 hours; limited number of nodes himem crestone gpu UCD/AMC 30 4/3/15

Research CU Boulder Parallel Computation Special software is required to run a calculation across more than one processor core or more than one node Sometimes the compiler can automatically parallelize an algorithm, and frequently special libraries and functions can be used to abstract the parallelism OpenMP For multiprocessing across cores in a single node Most compilers can auto-parallelize with OpenMP MPI – Message Passing Interface Can parallelize across multiple nodes Libraries/functions compiled into executable Several implementations (OpenMPI, mvapich, mpich, …) Usually requires a fast interconnect (like InfiniBand) UCD/AMC 31 4/3/15

Research CU Boulder Software Common software is available to all users via the “module” system To find out what software is available, you can type module avail To set up your environment to use a software package, type module load / Can install your own software But you are responsible for support We are happy to assist UCD/AMC 32 4/3/15

Research CU Boulder EXAMPLES UCD/AMC 33 4/3/15

Research CU Boulder Login and Modules example Log in: ssh ssh janus-compile2 List and load modules module list module avail module load openmpi/openmpi-1.8.0_intel module load slurm Compile parallel program mpicc -o hello hello.c UCD/AMC 34 4/3/15

Research CU Boulder Login and Modules example Log in: ssh List and load modules module list module avail matlab module load matlab/matlab-2014b module load slurm UCD/AMC 35 4/3/15

Research CU Boulder Submit Batch Job example Batch Script: #!/bin/bash #SBATCH –N 2 # Number of requested nodes #SBATCH --ntasks-per-node=12 # number of cores per node #SBATCH --time=1:00:00 # Max walltime #SBATCH --job-name=SLURMDemo # Job submission name #SBATCH --output=SLURMDemo.out # Output file name ###SBATCH -A # Allocation ###SBATCH --mail-type=end # Send on completion ###SBATCH --mail-user= # address mpirun./hello Submit the job: sbatch --qos janus-debug slurmSub.sh Check job status: squeue –q janus-debug cat SLURMDemo.out UCD/AMC 36 4/3/15

Research CU Boulder Batch Job example Contents of scripts Matlab_tutorial_general.sh Wrapper script loads the slurm commands Changes to the appropriate directory Calls the matlab.m files Matlab_tutorial_general_code.m Run matlab program as a batch job sbatch matlab_tutorial_general.sh Check job status: squeue –q janus-debug cat SERIAL.out UCD/AMC 37 4/3/15

Research CU Boulder Example to work on Take the wrapper script matlab_tutorial_general.sh Alter it to do the following: Rename your output file to whatever you’d like yourself the output; do it at the beginning of the job run Run on 12 cores Run on 2 nodes Rename your job to whatever you’d like Submit the batch script UCD/AMC 38 4/3/15

Research CU Boulder TRAINING, CONSULTING AND PARTNERSHIPS UCD/AMC 39 4/3/15

Research CU Boulder Training Weekly tutorials on computational science and engineering topics Meetup group Colorado-Computational-Science-and- Engineeringhttp:// Colorado-Computational-Science-and- Engineering All materials are online Various boot camps/tutorials UCD/AMC 40 4/3/15

Research CU Boulder Consulting Support in building software Workflow efficiency Parallel performance debugging and profiling Data management in collaboration with the Libraries See Tobin Magle Getting started with XSEDE UCD/AMC 41 4/3/15

Research CU Boulder Thank you! Questions? Slides available at