Research CU Boulder Additional Topics in Research Computing Shelley Knuth Peter Ruprecht
Research CU Boulder Outline Brief overview of Janus Update on replacement for Janus Alternate compute environment – condo PetaLibrary storage General supercomputing information Queues and job submission Keeping the system happy Using and installing software Compiling and submission examples Time for questions, feedback, and one-on-one UCD/AMC 2 4/3/15
Research CU Boulder What does Research Computing do? We manage Shared large scale compute resources Large scale storage High-speed network without firewalls – ScienceDMZ Software and tools We provide Consulting support for building scientific workflows on the RC platform Training Data management support in collaboration with the Libraries UCD/AMC 3 4/3/15
Research CU Boulder Research Computing - Contact Main contact Will go into our ticket system - main web sitehttp:// - data management resourcehttp://data.colorado.edu Mailing Lists (announcements and collaboration) Facebookhttps:// putinghttps:// puting Meetuphttp:// Computational-Science-and-Engineering/ Computational-Science-and-Engineering/ UCD/AMC 4 4/3/15
Research CU Boulder JANUS SUPERCOMPUTER AND OTHER COMPUTE RESOURCES UCD/AMC 5 4/3/15
Research CU Boulder What Is a Supercomputer? CU’s supercomputer is one large computer made up of many smaller computers and processors Each different computer is called a node Each node has processors/cores Carry out the instructions of the computer With a supercomputer, all these different computers talk to each other through a communications network On Janus – InfiniBand UCD/AMC 6 4/3/15
Research CU Boulder Why Use a Supercomputer? Supercomputers give you the opportunity to solve problems that are too complex for the desktop Might take hours, days, weeks, months, years If you use a supercomputer, might only take minutes, hours, days, or weeks Useful for problems that require large amounts of memory UCD/AMC 7 4/3/15
Research CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 8 4/3/15
Research CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 9 4/3/15
Research CU Boulder Hardware - Janus Supercomputer 1368 compute nodes (Dell C6100) 16,428 total cores No battery backup of the compute nodes Fully non-blocking QDR Infiniband network 960 TB of usable Lustre based scratch storage GB/s max throughput UCD/AMC 10 4/3/15
Research CU Boulder Different Node Types Login nodes This is where you are when you log in to login.rc.colorado.edu No heavy computation, interactive jobs, or long running processes Script or code editing, minor compiling Job submission Compile nodes Compiling code Job submission Compute/batch nodes This is where jobs that are submitted through the scheduler run Intended for heavy computation UCD/AMC 11 4/3/15
Research CU Boulder Additional Compute Resources 2 Graphics Processing Unit (GPU) Nodes Visualization of data Exploring GPUs for computing 4 High Memory Nodes (“himem”) 1 TB of memory, cores per node 16 Blades for long running jobs (“crestone”) 2-week walltimes allowed 96 GB of memory (4 times more compared to a Janus node) UCD/AMC 12 4/3/15
Research CU Boulder Replacing Janus Janus goes off warranty in July, but we expect to run it (or part of it) for another 9-12 months after that. Have applied for a National Science Foundation grant for a replacement cluster. About 3x the FLOPS performance of Janus Approx 350 compute nodes, 24 cores, 128 GB RAM Approx 20 nodes with GPU coprocessors Approx 20 nodes with Xeon Phi processors Scratch storage suited to a range of applications Decision from NSF expected by early Sept. Installation would begin early in UCD/AMC 13 4/3/15
Research CU Boulder “Condo” Cluster CU provides infrastructure such as network, data center space, and scratch storage Partners buy nodes that fit into the infrastructure Partners get priority access on their nodes Other condo partners can backfill onto nodes not currently in use Node options will be updated with current hardware each year A given generation of nodes will run 5 years Little or no customization of OS and hardware options UCD/AMC 14 4/3/15
Research CU Boulder Condo Node Options Compute node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet Potential uses: general purpose high-performance computation that doesn’t require low-latency parallel communication between nodes $6,875 GPU node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet 2x NVIDIA K20M GPU coprocessors Potential uses: molecular dynamics, image processing $12,509 UCD/AMC 15 4/3/15
Research CU Boulder Node Options – cont’d High-memory node 4x 12-core 3.0 GHz Intel “Ivy Bridge” processors 1024 GB RAM 10x 1 TB 7200 RPM hard drives in high-performance RAID configuration 10 gigabit/s Ethernet Potential uses: genetics/genomics or finite-element applications $34,511 Notes: Dell is providing bulk discount pricing even for incremental purchases 5-year warranty Network infrastructure in place to support over 100 nodes at 10 Gbps UCD/AMC 16 4/3/15
Research CU Boulder Condo Scheduling Overview All jobs are run through a batch/queue system (Slurm.) Interactive jobs on compute nodes are allowed but these must be initiated through the scheduler. Each partner group has its own high-priority queue for jobs that will run on nodes that it has purchased. High-priority jobs are guaranteed to start running within 24 hours (unless other high-priority jobs are already running) and can run for a maximum wall time of 14 days. All partners also have access to a low-priority queue that can run on any condo nodes that are not already in use by their owners. UCD/AMC 17 4/3/15
Research CU Boulder RMACC conference Rocky Mountain Advanced Computing Consortium August 11-13, 2015, Wolf Law Building, CU- Boulder campus Hands on and lecture tutorials about HPC, data, and technical topics Educational and student focused panels Poster session for students UCD/AMC 18 4/3/15
Research CU Boulder DATA STORAGE AND TRANSFER UCD/AMC 19 4/3/15
Research CU Boulder Storage Spaces Home Directories Not high performance; not for direct computational output 2 GB quota; snapshotted frequently /home/user1234 Project Spaces Not high performance; can be used to store or share programs, input files, maybe small data files 250 GB quota; snapshotted regularly /projects/user1234 Lustre Parallel Scratch Filesystem No hard quotas; no backups Files created more than 180 days in the past may be purged at any time /lustre/janus_scratch/user1234 UCD/AMC 20 4/3/15
Research CU Boulder Research Data Storage: PetaLibrary Long term, large scale storage option “Active” or “Archive” Primary storage system for condo Modest charge to use Provide expertise and services around this storage Data management Consulting UCD/AMC 21 4/3/15
Research CU Boulder PetaLibrary: Cost to Researchers Additional free options Snapshots: on-disk incremental backups Sharing through Globus Online 22 UCD/AMC4/3/15
Research CU Boulder Keeping Lustre Happy Janus Lustre is tuned for large parallel I/O operations. Creating, reading, writing, or removing many small files simultaneously can cause performance problems. Don’t put more than 10,000 files in a single directory. Avoid “ls –l” in a large directory. Avoid shell wildcard expansions ( * ) in large directories. UCD/AMC 23 4/3/15
Research CU Boulder Data Sharing and Transfers Globus tools: Globus Online and gridftp Easier external access without CU-RC account, especially for external collaborators Endpoint: colorado#gridftp SSH: scp, sftp, rsync Adequate for smaller transfers Connections to RC resources are limited by UCD bandwidth UCD/AMC 24 4/3/15
Research CU Boulder ACCESSING AND UTILIZING RC RESOURCES UCD/AMC 25 4/3/15
Research CU Boulder Initial Steps to Use RC Systems Apply for an RC account Read Terms and Conditions Choose “University of Colorado – Denver” Use your UCD address Get a One-Time Password device Apply for a computing allocation Initially you will be added to UCD’s existing umbrella allocation Subsequently will need a brief proposal for additional time UCD/AMC 26 4/3/15
Research CU Boulder Logging In SSH to login.rc.colorado.edu Use your RC username and One-Time Password From a login node, can SSH to a compile node - janus-compile1(2,3,4) - to build your programs All RC nodes use RedHat Enterprise Linux 6 UCD/AMC 27 4/3/15
Research CU Boulder Job Scheduling On supercomputer, jobs are scheduled rather than just run instantly at the command line Several different scheduling software packages are in common use Licensing, performance, and functionality issues have caused us to choose Slurm SLURM = Simple Linux Utility for Resource Management Open source Increasingly popular at other sites More at UCD/AMC 28 4/3/15
Research CU Boulder Running Jobs Do NOT compute on login or compile nodes Interactive jobs Work interactively at the command line of a compute node Command: salloc --qos=janus-debug sinteractive --qos=janus-debug Batch jobs Submit job that will be executed when resources are available Create a text file containing information about the job Submit the job file to a queue sbatch --qos= file UCD/AMC 29 4/3/15
Research CU Boulder Quality of Service = Queues janus-debug Only debugging – no production work Maximum wall time 1 hour, 2 jobs per user (The maximum amount of time your job is allowed to run) janus Default Maximum wall time of 24 hours janus-long Maximum wall time of 168 hours; limited number of nodes himem crestone gpu UCD/AMC 30 4/3/15
Research CU Boulder Parallel Computation Special software is required to run a calculation across more than one processor core or more than one node Sometimes the compiler can automatically parallelize an algorithm, and frequently special libraries and functions can be used to abstract the parallelism OpenMP For multiprocessing across cores in a single node Most compilers can auto-parallelize with OpenMP MPI – Message Passing Interface Can parallelize across multiple nodes Libraries/functions compiled into executable Several implementations (OpenMPI, mvapich, mpich, …) Usually requires a fast interconnect (like InfiniBand) UCD/AMC 31 4/3/15
Research CU Boulder Software Common software is available to all users via the “module” system To find out what software is available, you can type module avail To set up your environment to use a software package, type module load / Can install your own software But you are responsible for support We are happy to assist UCD/AMC 32 4/3/15
Research CU Boulder EXAMPLES UCD/AMC 33 4/3/15
Research CU Boulder Login and Modules example Log in: ssh ssh janus-compile2 List and load modules module list module avail module load openmpi/openmpi-1.8.0_intel module load slurm Compile parallel program mpicc -o hello hello.c UCD/AMC 34 4/3/15
Research CU Boulder Login and Modules example Log in: ssh List and load modules module list module avail matlab module load matlab/matlab-2014b module load slurm UCD/AMC 35 4/3/15
Research CU Boulder Submit Batch Job example Batch Script: #!/bin/bash #SBATCH –N 2 # Number of requested nodes #SBATCH --ntasks-per-node=12 # number of cores per node #SBATCH --time=1:00:00 # Max walltime #SBATCH --job-name=SLURMDemo # Job submission name #SBATCH --output=SLURMDemo.out # Output file name ###SBATCH -A # Allocation ###SBATCH --mail-type=end # Send on completion ###SBATCH --mail-user= # address mpirun./hello Submit the job: sbatch --qos janus-debug slurmSub.sh Check job status: squeue –q janus-debug cat SLURMDemo.out UCD/AMC 36 4/3/15
Research CU Boulder Batch Job example Contents of scripts Matlab_tutorial_general.sh Wrapper script loads the slurm commands Changes to the appropriate directory Calls the matlab.m files Matlab_tutorial_general_code.m Run matlab program as a batch job sbatch matlab_tutorial_general.sh Check job status: squeue –q janus-debug cat SERIAL.out UCD/AMC 37 4/3/15
Research CU Boulder Example to work on Take the wrapper script matlab_tutorial_general.sh Alter it to do the following: Rename your output file to whatever you’d like yourself the output; do it at the beginning of the job run Run on 12 cores Run on 2 nodes Rename your job to whatever you’d like Submit the batch script UCD/AMC 38 4/3/15
Research CU Boulder TRAINING, CONSULTING AND PARTNERSHIPS UCD/AMC 39 4/3/15
Research CU Boulder Training Weekly tutorials on computational science and engineering topics Meetup group Colorado-Computational-Science-and- Engineeringhttp:// Colorado-Computational-Science-and- Engineering All materials are online Various boot camps/tutorials UCD/AMC 40 4/3/15
Research CU Boulder Consulting Support in building software Workflow efficiency Parallel performance debugging and profiling Data management in collaboration with the Libraries See Tobin Magle Getting started with XSEDE UCD/AMC 41 4/3/15
Research CU Boulder Thank you! Questions? Slides available at