IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

Slides:



Advertisements
Similar presentations
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692,
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Grid Computing I CONDOR.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.
Large Output and Shared File Systems
Intermediate HTCondor: Workflows Monday pm
Example: Rapid Atmospheric Modeling System, ColoState U
The Landscape of Academic Research Computing
Harnessing the Power of Condor for Human Genetics
Condor: Job Management
Troubleshooting Your Jobs
Job Matching, Handling, and Other HTCondor Features
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Introduction to research computing using Condor
Presentation transcript:

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing

Welcome! 2

WhyWhyAreAreWeWeHere? 3

Transform Your Research WithComputing 4

Overview 5

Overview of Week Monday High Throughput Computing locally Miscellaneous Survey UW reimbursement form ‣‣‣‣‣‣ Tuesday Distributed High Throughput Computing Security Tour of Wisconsin Institutes for Discovery Wednesday Distributed storage Practical issues with DHTC Thursday From science to production Principles of HTC HTC Showcase Next steps 6

Overview of a Day Short introductory lectures Lots of hands-on exercises Some demos, interactive sessions, Optional evening sessions etc.etc. Monday – Wednesday, 7–9 p.m. Union South (check TITU) School staff on hand ‣‣‣‣ ‣ 7

KeysKeystotoSuccess Work hard Ask questions! … during lectures …………………… during exercises breaks meals in person is best, is OK If we do not know an answer, the person who does we willtry to find 8

Ready? 9

One Thing

Computing is Cheap!

Goals For This Session Understand the basics of High Throughput Computing Understand a few things about HTCondor, which is one kind ofHTC system commands Use basicHTCondor locally! Runajob 10

WhyWhyHTC? 11

Computing in Science Sc i e n c ec i e n c e 12 Theory Experiments

Computing in Science Sc i e n c ec i e n c e 12 Theory Computing Experiments

Example Challenge You have a program to run (simulation, Monte Carlo, data analysis, image analysis, stats, …) Each run takes about 1 hour You want to run the program 8 × 12 × 100 times 9600 hours ≈ 1.1 years …runningnonstop! Conference isnext week 13

Distributed Computing Use many computers to perform 1 computation Example: 2 computers => 4,800 hours ≈ ½ year 8 computers => 1,200 hours ≈ 2 months ‣‣‣‣‣‣‣‣ 100 computers => 96 hours 9,600 computers => 1 hour! =4 days (but …) 14 These computers are no faster than your laptop!

Performance vs. Throughput High Performance Computing (HPC) Focus on biggest, fastest systems (supercomputers) Maximize operations per second Often requires special code Often must request and wait for access ‣‣‣‣‣‣‣‣ High Throughput Computing (HTC) Focus on using all resources, reliably, all the time Maximize operations per year Use any kind of computer, even old, slow ones Must break task into separate, independent parts Access varies by availability, usage, etc. ‣‣‣‣‣‣‣‣‣‣ 15

HPCvsvsHTC: AnAnAnalogy 16

HPCvsvsHTC: AnAnAnalogy 16

Example HTCSite(Wisconsin) Our local HTC systems Recent CPU hours: ~~~~~~ 280,000 / day 8.3 million / month 78million/year 17

OpenScienceGrid HTC scaled way up Over 110 sites Mostly U.S. Some others Past year: ‣‣‣‣‣‣‣‣ ~170 million ~770 million jobs CPU hours ~372 petabytes transferred Can submit jobs locally, move to OSG OSG User SchoolCartwright – Intro to HTC18

Other Distributed Computing Other systems to manage a local cluster: PBS/Torque LSF Sun Grid Engine/Oracle Grid Engine SLURM ‣‣‣‣‣‣‣‣ Other wide-area systems: European Grid Infrastructure Other national and regional grids Commercial cloud systems used to augment ‣‣‣‣‣‣ grids HPC Various supercomputers (e.g., TOP500 list) XSEDE ‣‣‣‣ 19

HTCondor 20

HTCondor History and Status History Started in 1988 as a “cycle scavenger” Protected interests of users and machine owners ‣‣‣‣ Today Expanded to become CHTC team: 20+ full-time sta ff Current production release: HTCondor ‣‣‣‣‣‣ HTCondor software: ~700,000 lines ofC/C++code Miron Livny Professor, UW–Madison CompSci Director, CHTC Dir. of Core Comp. Tech., WID/MIR Tech. Director & PI, OSG ‣‣‣‣‣‣‣‣ 21

HTCondor Functions Users Define jobs, their requirements, and preferences ‣‣‣‣‣‣‣‣ Submit and cancel Check on the state jobs of a job of the machines Administrators Configure and control the HTCondor system Declare policies on machine use, pool use, etc. ‣‣‣‣ Internally Match jobs to machines (enforcing all policies) ‣‣‣‣‣‣ Track and manage machines run jobs 22

Terminology: Job Job: A computer program or one run of it Not interactive, no GUI (e.g., not Word or ) (How could you interact with 1,000 programs running at 1. Input: command-line arguments and/or files 2. Run: do stu ff 3. Output: standard output & error and/or files once?) Scheduling User decides when to submit job to be run System decides when to run job, based on policy ‣‣‣‣ 23

Terminology: Machine, Slot Machine A machine is a physical computer (typically) May have multiple processors (computer chips) One processor may have multiple cores (CPUs) ‣‣‣‣‣‣ HTCondor: Slot One assignable unit of a machine (i.e., 1 job per Most often, corresponds to one core Thus, typical machines today have 4–40 slots slot) ‣‣‣‣‣‣ Advanced HTCondor feature: Can get 1 slot with many cores on 1 machine, for MPI(-like) jobs 24

Terminology: Matchmaking Two-way process of finding a slot for a job Jobs have requirements and preferences E.g.: I need Red Hat Linux 6 and 100 GB of disk space, and prefer to get as much memory as possible Machines have requirements and preferences E.g.: I run jobs only from users in the Comp. Sci. dept., and prefer to run ones that ask for a lot of memory I I Important jobs Thus: Not as may replace less important ones simple as waiting in a line! 25

RunningaJob 26

Viewing Slots With no arguments, lists all slots currently in pool Summary info is printedat the end of thelist lecture Formoreinfo:exercises, -h,manual,next 27 Total Owner Claimed Unclaimed Matched Preempting Backfill NTEL/WINNT INTEL/WINNT X86_64/LINUX Total LINUX X86_64 Claimed Busy :09:32 LINUX X86_64 Claimed Busy :09:31 LINUX X86_64 Unclaimed Idle :37:54 LINUX X86_64 Claimed Busy :09:32 LINUX X86_64 Unclaimed Idle :55:15 LINUX X86_64 Unclaimed Idle :55:16 condor_status

Viewing Jobs With no args, lists all jobs waiting or running here Formoreinfo:exercises, -h,manual,nextlecture Submitter: osg-ss-submit.chtc.wisc.edu : :... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6.0 cat 11/12 09: :00:00 I explore.py 6.1 cat 11/12 09: :00:00 I explore.py 6.2 cat 11/12 09: :00:00 I explore.py 6.3 cat 11/12 09: :00:00 I explore.py 6.4 cat 11/12 09: :00:00 I explore.py 5 jobs; 5 idle, 0 running, 0 held condor_q

BasicSubmitFile 29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out error = word_freq.err log = word_freq.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue

BasicSubmitFile output=word_freq.out to executable run; surround when error=word_freq.err with log=word_freq.log doublequotes [opt] 29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" Command-line arguments to pass output = word_freq.out to executable log = word_freq.log double quotes [ should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue

BasicSubmitFile f standardoutput and log=word_freq.log errorfromtherun [opt] 2013 OSG User SchoolCartwright – Intro to HTC29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out Local files that will error = word_freq.err receive the contents o error from the run [ should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue

Basic SubmitFile should_transfer_files=YES 29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out Condor’s log file error = word_freq.err from running the log = word_freq.log job; very helpful, do not omit! when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue

BasicSubmitFile Comma- o machine [opt] queue 29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out error = word_freq.err log = word_freq.log should_transfer_files = YES separated list when_to_transfer_output = ON_EXIT of input files t transfer_input_files = words.txt transfer to machine [

BasicSubmitFile 29 executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out error = word_freq.err log = word_freq.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue

Submit a Job Submits job to local submit machine Use condor_q to track A job ID is written as cluster.process (e.g., 8.0 ) We will see how to make multiple processes later 30 Submitting job(s). 1 job(s) submitted to cluster NNN. condor_submit submit-file Each condor_submit creates one Cluster

Remove a Job Removes one or more jobs from thequeue Identify jobs by whole cluster or single jobID Onlyyou(or admin) canremoveyourjobs 31 Cluster NNN has been marked for removal. condor_rm cluster [...] condor_rm cluster.process [...]

YourTurn! 32

Thoughts on Exercises Copy-and-paste is quick, but you may learn more by typing out commands yourself Experiment! Try your own variations on the exercises If you have time, try to apply to your own work ‣‣‣‣ If you do not finish, that’s OK — you can make up work later or during evenings, if you like If you finish early, try any extra challenges or optional sections, or move ahead to the next section if you are brave 33

Sometime Today … Sometime today, do the exercise on getting X.509 personal certificate an ItItItItItIt is not required today will be required tomorrow afternoon early isbesttotostarttheprocess 34

Exercises! Ask questions! Lots of instructors around Coming next: Now – 10:30 10:30–10:45 10:45–11:15 11:15–12:15 Hands-on Break Lecture Hands-on exercises 35