High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Blackbird: Accelerated Course Archives Using Condor with Blackboard Sam Hoover, IT Systems Architect Matt Garrett, System Administrator.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

Condor and the Grid D. Thain, T. Tannenbaum, M. Livny Christopher M. Moretti 23 February 2007.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Introduction to Shell Script Programming

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Grid Computing I CONDOR.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.

Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

Grid job submission using HTCondor Andrew Lahiff.

Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

July 28' 2011INDIA-CMS_meeting_BARC1 Tier-3 TIFR Makrand Siddhabhatti DHEP, TIFR Mumbai July 291INDIA-CMS_meeting_BARC.

Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Advanced Computing Facility Introduction

Harnessing the Power of Condor for Human Genetics

Architecture & System Overview

Condor: Job Management

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Basic Grid Projects – Condor (Part I)

Introduction to High Throughput Computing and HTCondor

Genre1: Condor Grid: CSECCR

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

GLOW A Campus Grid within OSG

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009

Today’s Talk High Level Introduction (20 min) –What is Condor? –How does it work? –What is it good for? Hands-On Tutorial (30 min) –Finding Resources –Submitting Jobs –Managing Jobs –Ideas for Scaling Up

The Cooperative Computing Lab We create software that enables the reliable sharing of cycles and storage capacity between cooperating people. We conduct research on the effectiveness of various systems and strategies for large scale computing. We collaborate with others that need to use large scale computing, so as to find the real problems and make an impact on the world. We operate systems like Condor that directly support research and collaboration at ND.

What is Condor? Condor is software from UW-Madison that harnesses idle cycles from existing machines. (Most workstations are ~90% idle!) With the assistance of CSE, OIT, and CRC staff, Condor has been installed on ~700 cores in Engineering and Science since early The Condor pool expands the capabilities of researchers in to perform both cycle and storage intensive research. New users and contributors are welcome to join!

Condor Distributed Batch System (~700 cores) ccl 8x1 cclsun 16x2 loco 32x2 sc0 32x2 netscale 16x2 cvrl 32x2 iss 44x2 compbio 1x8 netscale 1x32 Fitzpatrick 130 CSE 170 CHEG 25 EE 10 Nieu 20 DeBart 10 MPI HadoopBiometrics Storage Research Network Research Network Research Timeshared Collaboration Personal Workstations Storage Research Batch Capacity www portals login nodes db server Primary Interactive Users Batch Users central mgr Purdue ~10k cores Wisconsin ~5k cores “flocking” to other condor pools green house

The Condor Principle Machine Owners Have Absolute Control –Set who, what, and when can use machine. –Can kick jobs off at any time manually. Default policy that satisfies most people: Start job if console idle > 15 minutes Suspend job if console used or CPU busy. Kick off job if suspended > 10 minutes. –After that, jobs run in this order: owner, research group, Notre Dame, elsewhere. For the full technical details, see:

What’s the value proposition? If you install Condor on your workstations, servers, or clusters, then: –You retain immediate, preemptive priority on your machines, both batch and interactive. –You gain access to the unused cycles available on other machines. –By the way, other people get to use your machines when you are not.

Condor Architecture match maker scheddstartd I want an INTEL CPU with > 3GB RAM I prefer to run jobs owned by user “joe”. You two should talk to each other. Run job with files X, Y. Represents a user with jobs to run. Represents an available machine. job X Y Y Y

~700 CPUs at Notre Dame match maker startd schedd

Flocking to Other Sites 2000 CPUs University of Wisconsin 20,000 CPUs Purdue University 700 CPUs Notre Dame

What is Condor Good For? Condor works well on large workflows of sequential jobs, provided that they match the machines available to you. Ideal workload: –One million jobs that require one hour each. Doesn’t work at all: –An 8-node MPI job that must run now. Many workloads can be converted into the ideal form, with varying degrees of effort.

High Throughput Computing Condor is not High Performance Computing –HPC: Run one program as fast as possible. Condor is High Throughput Computing –HTC: Run as many programs as possible before my paper deadline on May 1 st.

Intermission and Questions

Getting Started: If your shell is tcsh: % setenv PATH /afs/nd.edu/user37/condor/software/bin:$PATH If your shell is bash: % export PATH=/afs/nd.edu/user37/condor/software/bin:$PATH Then, create a temporary working space: % mkdir /tmp/YOURNAME % cd /tmp/YOURNAME

Viewing Available Resources Condor Status Web Page: – Command Line Tool: –condor_status –condor_status –constraint ‘(Memory>2048)’ –condor_status –constraint ‘(Arch==“INTEL”)’ –condor_status –constraint ‘(OpSys==“LINUX”)’ –condor_status -run –condor_status –submitters –condor_status -pool boilergrid.rcac.purdue.edu

A Simple Script Job #!/bin/sh echo date uname –a % vi simple.sh % chmod 755 simple.sh %./simple.sh hello world

% vi simple.submit A Simple Submit File universe = vanilla executable = simple.sh arguments = hello condor output = simple.stdout error = simple.stderr should_transfer_files = yes when_to_transfer_output = on_exit log = simple.logfile queue

Submitting and Watching a Job Submit the job: –condor_submit simple.submit Look at the job queue: –condor_q Remove a job: –condor_rm See where the job went: –tail -f simple.logfile

% vi simple.submit Submitting Lots of Jobs universe = vanilla executable = simple.sh arguments = hello $(PROCESS) output = simple.stdout.$(PROCESS) error = simple.stderr.$(PROCESS) should_transfer_files = yes when_to_transfer_output = on_exit log = simple.logfile queue 50

What Happened to All My Jobs?

Setting Requirements By default, Condor will only run your job on a machine with the same CPU and OS as the submitter. Use requirements to send your job to other kinds of machines: –requirements = (Memory>2084) –requirements = (Arch==“INTEL” || Arch==“X86_64”) –requirements = (MachineGroup==“fitzlab”) –requirements = (UidDomain!=“nd.edu”) (Hint: Try out your requirements expressions using condor_status as above.)

Setting Requirements By default, Condor will assume any machine that satisfies your requirements is sufficient. Use the rank expression to indicate which machines that you prefer: –rank = (Memory>1024) –rank = (MachineGroup==“fitzlab”) –rank = (Arch==“INTEL”)*10 + (Arch==“X86_64”)*20

File Transfer Notes to keep in mind: –Condor cannot write to AFS. (no creds) –Not all machines in Condor have AFS. So, you must specify what files your job needs, and Condor will send them there: –transfer_input_files = x.dat, y.calib, z.library By default, all files created by your job will be sent home automatically.

In Class Assignment Execute 50 jobs that run on a machine not at Notre Dame that has >1GB RAM.