More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Grid Computing I CONDOR.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Grid job submission using HTCondor Andrew Lahiff.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

An Introduction to Using

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Improvements to Configuration

Large Output and Shared File Systems

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

More HTCondor Monday AM, Lecture 2 Ian Ross

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Intermediate HTCondor: Workflows Monday pm

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

Examples Example: UW-Madison CHTC Example: Global CMS Pool

High Availability in HTCondor

Monitoring HTCondor with Ganglia

Building Grids with Condor

Condor: Job Management

Accounting in HTCondor

Troubleshooting Your Jobs

Negotiator Policy and Configuration

Accounting, Group Quotas, and User Priorities

Submitting Many Jobs at Once

Job Matching, Handling, and Other HTCondor Features

Basic Grid Projects – Condor (Part I)

Introduction to High Throughput Computing and HTCondor

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Condor: Firewall Mirroring

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Thursday AM, Lecture 1 Lauren Michael

Presentation transcript:

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison

Questions so far? 2

Goals For This Session Understand the mechanisms of HTCondor (and HTC in general) a bit more deeply Use a few more HTCondor features Run more (and more complex) jobs at once 3

HTCondor in Depth 4

Why Is HTC Difficult? System must track jobs, machines, policy, … System must recover gracefully from failures Try to use all available resources, all the time Lots of variety in users, machines, networks, … Sharing is hard (e.g., policy, security) More about the principles of HTC on Thursday 2013 OSG User SchoolCartwright - More HTCondor5

Main Parts of HTCondor 2013 OSG User SchoolCartwright - More HTCondor6

Main Parts of HTCondor Function Track waiting/running jobs Track available machines Match jobs and machines Manage one machine Manage one job (on submitter) Manage one job (on machine) 6

Main Parts of HTCondor Function Track waiting/running jobs Track available machines Match jobs and machines Manage one machine HTCondor Name schedd (“sked-dee”) collector negotiator startd (“start-dee”) Manage one job (on submitter) shadow Manage one job (on machine) starter 2013 OSG User SchoolCartwright - More HTCondor6

Main Parts of HTCondor Function Track waiting/running jobs Track available machines Match jobs and machines Manage one machine HTCondor Name schedd (“sked-dee”) collector negotiator startd (“start-dee”) # per machine per job Manage one job (on submitter) shadow Manage one job (on machine) starter running per job running 6

Typical Architecture Central Manager collector + negotiator Submit schedd Submit schedd Submit schedd Execute startd Execute startd Execute startd · · · Execute startd Execute startd Execute startd 7

Typical Architecture Central Manager collector + negotiator osg-ss-submit Submit schedd Submit schedd Submit schedd Execute startd Execute startd Execute startd · · · Execute startd Execute startd Execute startd 7

Typical Architecture cm.chtc.wisc.edu Central Manager collector + negotiator Submit schedd Submit schedd Submit schedd Execute startd Execute startd Execute startd · · · Execute startd Execute startd Execute startd 7

Typical Architecture Central Manager collector + negotiator eNNN.chtc.wisc.edu Submit schedd Submit schedd Submit schedd Execute startd Execute startd Execute startd · · · Execute startd Execute startd Execute startd 7

Typical Architecture Central Manager collector + negotiator cNNN.chtc.wisc.edu Submit schedd Submit schedd Submit schedd Execute startd Execute startd Execute startd · · · Execute startd Execute startd Execute startd 7

The Life of an HTCondor Job Central Manager negotiatorcollector scheddstartd Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiatorcollector send periodic updates scheddstartd Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiatorcollector send periodic updates scheddstartd 1. submit job Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiatorcollector 2. request send periodic job details updates scheddstartd 1. submit job Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiatorcollector 2. request 3. send send periodic job details jobs updates scheddstartd 1. submit job Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiatorcollector 2. request 3. send 4. notify send periodic job details jobs of match updates scheddstartd 1. submit job Submit Machine Execute Machine 8

The Life of an HTCondor Job Central Manager negotiator 2. request3. send4. notify job detailsjobsof match schedd 1. submit job Submit Machine 5. claim collector send periodic updates startd Execute Machine 8

The Life of an HTCondor Job Central Manager negotiator 2. request3. send4. notify job detailsjobsof match schedd 6. start 5. claim collector send periodic updates startd 6. start 1. submit job shadow Submit Machine starter Execute Machine 8

The Life of an HTCondor Job Central Manager negotiator 2. request3. send4. notify job detailsjobsof match schedd 6. start 5. claim collector send periodic updates startd 6. start 1. submit job 7. transfer exec, input shadow Submit Machine starter Execute Machine 8

The Life of an HTCondor Job Central Manager negotiator 2. request3. send4. notify job detailsjobsof match schedd 6. start 5. claim collector send periodic updates startd 6. start 1. submit job 7. transfer exec, input shadow Submit Machine starter 8. start job Execute Machine 8

The Life of an HTCondor Job Central Manager negotiator 2. request3. send4. notify job detailsjobsof match schedd 6. start 5. claim collector send periodic updates startd 6. start 1. submit job 7. transfer exec, input shadow 9. transfer output Submit Machine starter 8. start job Execute Machine 8

Matchmaking Algorithm (sort of) A. Gather lists of machines and waiting jobs B. For each user: 1. Compute maximum # of slots to allocate to user (the user’s “fair share”, a % of whole pool) 2. For each job (until maximum matches reached): a. Find all machines that are acceptable (i.e., machine and job requirements are met) b. If there are no acceptable machines, skip to next job c. Sort acceptable machines by job preferences d. Pick the best one e. Record match of job and slot 9

ClassAds In HTCondor, information about machines and jobs (and more) are represented by ClassAds You do not write ClassAds (much), but reading them may help understanding and debugging ClassAds can represent persistent facts, current state, preferences, requirements, … HTCondor uses a core of predefined attributes, but users can add other, new attributes, which can be used for matchmaking, reporting, etc. 10

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 11

Sample ClassAd Attributes string MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 11

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" number Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 11

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 operations/ expressions 11

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 boolean 11

Sample ClassAd Attributes MyType = "Job" TargetType = "Machine" ClusterId = 14 Owner = "cat" Cmd = "/.../test-job.py" Requirements = (Arch == "X86_64") && (OpSys == "LINUX") Rank = 0.0 In = "/dev/null" UserLog = "/.../test-job.log" Out = "test-job.out" Err = "test-job.err" NiceUser = false ShoeSize = 10 arbitrary

HTCondor Universes Different combinations of configurations and features are bundled as universes : vanilla A “normal” job; default, fine for today standard Supports checkpointing and remote I/O java Special support for Java programs parallel Supports parallel jobs (such as MPI) grid Submits to remote system (more tomorrow) … and more! 12

HTCondor Priorities Job priority ‣ Set per job by the user (owner) ‣ Relative to that user’s other jobs ‣ Set in submit file or change later with condor_prio ‣ Higher number means run sooner User priority ‣ Computed based on past usage ‣ Determines user’s “fair share” percentage of slots ‣ Lower number means run sooner (0.5 is minimum) Preemption ‣ Low priority jobs stopped for high priority ones (stopped jobs go back into the regular queue) ‣ Governed by fair-share algorithm and pool policy ‣ Not enabled on all pools 13

HTCondor Commands 14

List Jobs: condor_q Select jobs: by user (e.g., you), cluster, job ID Format output as you like View full ClassAd(s), typically attributes (most useful when limited to a single job ID) Ask HTCondor why a job is not running ‣ May not explain everything, but can help ‣ Remember: Negotiation happens periodically Explore condor_q options in next exercises 15

List Slots: condor_status Select slots: available, host, specific slot Select slots by ClassAd expression E.g., slots with SL 6 (OS) and ≥ 10 GB memory Format output as you like View full ClassAd(s), typically attributes (most useful when limited to a single slot) Explore condor_status options in exercises 2013 OSG User SchoolCartwright - More HTCondor16

Submit Files 17

Resource Requests request_cpus = ClassAdExpression request_disk = ClassAdExpression request_memory = ClassAdExpression Ask for minimum resources of execute machine May be dynamically allocated (very advanced!) Check job log for actual usage!!! request_disk = # in KB by default request_disk = 2GB # KB, MB, GB, TB request_memory = 2000# in MB by default request_memory = 2GB # KB, MB, GB, TB 18

File Access in HTCondor Option 1: Shared filesystem ‣ Easy to use (jobs just access files) ‣ But, must exist and be ready handle load should_transfer_files = NO Option 2: HTCondor transfers files for you ‣ Must name all input files (except executable) ‣ May name output files; defaults to all new/changed should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = a.txt, b.tgz 19

Notifications notification = Always|Complete|Error|Never When to send ‣ Always : job checkpoints or completes ‣ Complete : job completes (default) ‣ Error : job completes with error ‣ Never : do not send notify_user = Where to send Defaults to 20

Requirements and Rank requirements = ClassAdExpression Expression must evaluate to true to match slot HTCondor adds defaults! Check ClassAds … See HTCondor Manual (esp & 4.1) for more rank = ClassAdExpression Ranks matching slots in order by preference Must evaluate to a FP number, higher is better ‣ False becomes 0.0, True becomes 1.0 ‣ Undefined or error values become 0.0 Writing rank expressions is an art form 21

Arbitrary Attributes +AttributeName = value Adds arbitrary attribute(s) to job’s ClassAd Useful in (at least) 2 cases: ‣ Affect matchmaking with special attributes ‣ Report on jobs with specific attribute value Experiment with reporting during exercises! 22

Many Jobs Per Submit File, Pt. 1 Can use queue statement many times Make changes between queue statements ‣ Change arguments, log, output, input files, … ‣ Whatever is not explicitly changed remains the same executable = test.py log= test.log output= test-1.out arguments = "test-input.txt 42" queue output= test-2.out arguments = "test-input.txt 43" queue 23

Many Jobs Per Submit File, Pt. 1 Can use queue statement many times Make changes between queue statements ‣ Change arguments, log, output, input files, … ‣ Whatever is not explicitly changed remains the same executable = test.py log= test.log output= test-1.out arguments = "test-input.txt 42" queue output= test-2.out arguments = "test-input.txt 43" queue log = test.log (still) 23

Many Jobs Per Submit File, Pt. 2 queue N Submits N copies of the job ‣ One cluster number for all copies, just as before ‣ Process numbers go from 0 to ( N -1) What good is having N copies of the same job? ‣ Randomized processes (e.g., Monte Carlo) ‣ Job fetches work description from somewhere else ‣ But what about overwriting output files, etc.? Wouldn’t it be nice to have different files and/or arguments automatically applied to each job? 24

Separating Files by Run output = program.out.$(Cluster).$(Process) Can use these variables anywhere in submit file ‣ Often used in output, error, and log files Maybe use $(Process) in arguments? ‣ Can’t perform math on values; code must accept as is output = test.$(Cluster)_$(Process).out log = test.$(Cluster)_$(Process).log arguments = "test-input.txt $(Process)" queue 10 25

Separating Directories by Run initialdir = path Use path (instead of submit dir.) to locate files ‣ E.g.: output, error, log, transfer_input_files ‣ Not executable ; it is still relative to submit directory Use $(Process) to separate all I/O by job ID initialdir = run-$(Process) transfer_input_files = input-$(Process).txt output = test.$(Cluster)-$(Process).out log= test.$(Cluster)-$(Process).log arguments = "input-$(Process).txt $(Process)" queue 10 26

Your Turn! 27

Exercises! Ask questions! Lots of instructors around Reminder: Get your X.509 certificate today! Coming next: Now - 12:15Hands-on exercises 12:15-1:15Lunch 1:15-5:30Afternoon sessions 28