An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Content Contributed by the University of Wisconsin Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Computing I CONDOR.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
The Landscape of Academic Research Computing
High Availability in HTCondor
Building Grids with Condor
Accounting, Group Quotas, and User Priorities
Haiyan Meng and Douglas Thain
Job Matching, Handling, and Other HTCondor Features
Basic Grid Projects – Condor (Part I)
Genre1: Condor Grid: CSECCR
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
JRA 1 Progress Report ETICS 2 All-Hands Meeting
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison

OSG Summer School 2011 Who Am I? With Condor since 2001 Open Science Grid Software Coordinator Taught at six previous summer schools I have the two cutest kids in the whole world: 2

OSG Summer School 2011 A dinner follow-up: Snow in Wisconsin 3

OSG Summer School 2011 Overview of day Lectures alternating with exercises  Emphasis on lots of exercises  Hopefully overcome PowerPoint fatigue & help you understand better 4

OSG Summer School 2011 Some thoughts on the exercises It’s okay to move ahead on exercises if you have time It’s okay to take longer on them if you need to If you move along quickly, try the “On Your Own” sections and “Challenges” We’ll have an optional evening session if you want more time, and I’ll be there to give help 5

OSG Summer School 2011 Most important! Please ask me questions! …during the lectures …during the exercises …during the breaks …during the meals …over dinner …the rest of the week If I don’t know, I’ll find the right person to answer your question. 6

OSG Summer School 2011 Before we start Sometime today, do the exercise on getting a certificate.  It is required for all exercises Tuesday- Thursday  It will be easiest if you do it today 7

OSG Summer School 2011 Goals for this session Understand basics of high-throughput computing Understand the basics of Condor Run a basic Condor job 8

OSG Summer School 2011 What is high-throughput computing? (HTC) An approach to distributed computing that focuses on long-term throughput, not instantaneous computing power.  We don’t care about operations per second  We care about operations per year Implications:  Focus on reliability  Use all available resources (not just high- end supercomputers)  That slow four-year old cluster down the hall? Include it! 9

OSG Summer School 2011 Good uses of HTC “I need as many simulation results as possible before my deadline…” “I have lots of small-ish independent tasks that can be run indepdently” 10

OSG Summer School 2011 What’s not HTC? The need for “real-time” results:  (Ignore fact that “real-time” is hard to define.)  HTC is less worried about latency (delay to answer) than total throughput The need to maximize FLOPS  “I must use the a supercomputer, because I need the fastest computer/network/storage/…” 11

OSG Summer School 2011 An example problem: BLAST A scientist has:  Question: Does a protein sequence occur in other organisms?  Data: lots of protein sequences from various organisms  Parameters: how to search the database. More throughput means  More protein sequences queried  Larger/more protein data bases examined  More parameter variation We’ll try out BLAST later today 12

OSG Summer School 2011 Why is HTC hard? The HTC system has to keep track of:  Individual tasks (a.k.a. jobs) & their inputs  Computers that are available The system has to recover from failures  There will be failures! Distributed computers means more chances for failures. You have to share computers  Sharing can be within an organization, or between orgs  So you have to worry about security.  And you have to worry about policies on how you share. If you use a lot of computers, you have to deal variety:  Different kinds of computers (arch, OS, speed, etc..)  Different kinds of storage (access methodology, size, speed, etc…)  Different networks interacting (network problems are hard to debug!) 13

OSG Summer School 2011 Let’s take one step at a time Can you run one job on one computer? Can you run one job on another local computer? Can you run 10 jobs on a set of local computers? Can you run 1 job on a remote computer? Can you run 10 jobs at a remote site? Can you run a mix of jobs here and remotely? This is the (rough) path we’ll take in the school this week 14 Small Large Local Distributed

OSG Summer School 2011 Discussion For 5 minutes, talk to a neighbor: If you want to run one job in a local cluster of computers: 1)What do you (the user) need to provide so a single job can be run? 2)What does the system need to provide so your single job can be run?  Think of this as a set of processes: what needs happen when the job is given? A “process” could be a computer process, or just an abstract task. 15

OSG Summer School 2011 Alain’s answer: What does the user provide? A “headless job”.  Not interactive/no GUI: how could you interact with 1000 simultaneous jobs? A set of input files A set of output files. A set of parameters (command-line arguments). Requirements:  Ex: My job requires at least 2GB of RAM.  Ex: My job requires Linux. Control/Policy:  Ex: Send me when the job is done.  Ex: Job 2 is more important than Job 1.  Ex: Kill my job if it’s run for more than 6 hours. 16

OSG Summer School 2011 Alain’s answer: What does the system provide? Methods to:  Submit/Cancel job.  Check on state of job.  Check on state of available computers. Processes to:  Reliably track set of submitted jobs.  Reliably track set of available computers.  Decide which job runs on which computer.  Manage a single computer.  Start up a single job. 17

OSG Summer School 2011 Surprise! Condor does this (and more) Methods to:  Submit/Cancel job. condor_submit/condor_rm  Check on state of job. condor_q  Check on state of avail. computers. condor_status Processes to:  Reliably track set of submitted jobs. schedd  Reliably track set of avail. computers. collector  Decide which job runs on where. negotiator  Manage a single computer startd  Start up a single job starter 18

OSG Summer School 2011 But not only Condor You can use other systems:  PBS/Torque  Oracle Grid Engine (né Sun Grid Engine)  LSF ... But we won’t cover them.  Our expertise is with Condor  Our bias is with Condor What should you learn?  How do you think about HTC?  How can you do your science with HTC?  … For now, learn it with Condor, but you can apply it to other systems. 19

OSG Summer School 2011 A brief introduction to Condor Please note, we will only scratch the surface of Condor:  We won’t cover MPI, Master-Worker, advanced policies, site administration, security mechanisms, Condor-C, submission to other batch systems, virtual machines, cron, high-availability, computing on demand, … Why?  The goal is to introduce you to HTC by using Condor, not to make you Condor experts. 20

OSG Summer School 2011 …And matches them Condor Takes Computers… Dedicated Clusters Desktop Computers …And jobs I need a Mac! I need a Linux box with 2GB RAM! Matchmaker

OSG Summer School 2011 Quick Terminology Cluster: A dedicated set of computers not for interactive use Pool: A collection of computers used by Condor  May be dedicated  May be interactive

OSG Summer School 2011 Matchmaking Matchmaking is fundamental to Condor Matchmaking is two-way  Job describes what it requires: I need Linux && 8 GB of RAM  Machine describes what it requires: I will only run jobs from the Physics department Matchmaking allows preferences  I need Linux, and I prefer machines with more memory but will run on any machine you provide me

OSG Summer School 2011 Why Two-way Matching? Condor conceptually divides people into three groups:  Job submitters  Machine owners  Pool (cluster) administrator All three of these groups have preferences } May or may not be the same people

OSG Summer School 2011 ClassAds ClassAds state facts  My job’s executable is analysis.exe  My machine’s load average is 5.6 ClassAds state preferences  I require a computer with Linux

OSG Summer School 2011 ClassAds ClassAds are: – semi-structured – user-extensible – schema-free – Attribute = Expression Example: MyType = "Job" TargetType = "Machine" ClusterId = 1377 Owner = "roy“ Cmd = “analysis.exe“ Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … String Number Boolean

OSG Summer School 2011 Schema-free ClassAds Condor imposes some schema  Owner is a string, ClusterID is a number… But users can extend it however they like, for jobs or machines  AnalysisJobType = “simulation”  HasJava_1_4 = TRUE  ShoeLength = 7 Matchmaking can use these attributes  Requirements = OpSys == "LINUX" && HasJava_1_4 == TRUE

OSG Summer School 2011 Submitting jobs: condor_schedd Users submit jobs from a computer  Jobs described as ClassAds  Each submission computer has a queue  Queues are not centralized  Submission computer watches over queue  Can have multiple submission computers  Submission handled by condor_schedd Condor_schedd Queue

OSG Summer School 2011 Advertising computers Machine owners describe computers  Configuration file extends ClassAd  ClassAd has dynamic features  Load Average  Free Memory  …  ClassAds are sent to Matchmaker Matchmaker (Collector) ClassAd Type = “Machine” Requirements = “…”

OSG Summer School 2011 Matchmaking Negotiator collects list of computers Negotiator contacts each schedd  What jobs do you have to run? Negotiator compares each job to each computer  Evaluate requirements of job & machine  Evaluate in context of both ClassAds  If both evaluate to true, there is a match Upon match, schedd contacts execution computer

OSG Summer School 2011 Matchmaking Service Job queue service Information service Matchmaking diagram condor_schedd Queue Matchmaker Collector Negotiator 1 2 3

OSG Summer School 2011 Manages Remote Job Manages Machine Running a Job condor_schedd Queue Matchmaker condor_collector condor_negotiator condor_startd condor_submit Manages Local Job condor_shadow condor_starter Job

OSG Summer School 2011 Condor processes ProceessFunction MasterTakes care of other processes CollectorStores ClassAds NegotiatorPerforms Matchmaking ScheddManages job queue ShadowManages job (submit side) StartdManages computer StarterManages job (executtion side)

OSG Summer School 2011 If you forget most of these remember two (for other lectures) ProceessFunction MasterTakes care of other processes CollectorStores ClassAds NegotiatorPerforms Matchmaking ScheddManages job queue ShadowManages job (submit side) StartdManages computer StarterManages job (executtion side) ScheddManages job queue StartdManages computer

OSG Summer School 2011 Some notes One negotiator/collector per pool Can have many schedds (submitters) Can have many startds (computers) A machine can have any combination of:  Just a startd (typical for a dedicated cluster)  schedd + startd (perhaps a desktop)  Personal Condor: everything

OSG Summer School 2011 Example Pool 1 Single Schedd (One job queue) Matchmaker Dedicated Cluster (Machine Room)

OSG Summer School 2011 Example Pool 1a Schedd (One job queue) Matchmaker Dedicated Cluster (Machine Room) Fail-Over Matchmakers Fail-Over Schedd

OSG Summer School 2011 Example Pool 2 Desktop Schedds & Execute Nodes (Many job queues) Matchmaker Dedicated Cluster (Machine Room) Note: Condor’s policy capabilities let you choose when desktops act as execute nodes.

OSG Summer School 2011 Our Condor Pools One submit computer  One schedd/queue for everyone  vdt-itb.cs.wisc.edu One local set of dedicated Condor execute nodes:  About 8 computers, about 45 available “batch slots” Remote resources at UNL  For Tuesday, not today.

OSG Summer School 2011 Our Condor Pool Name OpSys Arch State Activity LoadAv Mem ActvtyTime LINUX INTEL Unclaimed Idle :00:04 LINUX INTEL Unclaimed Idle :00:05 LINUX INTEL Unclaimed Idle :29:46 LINUX INTEL Unclaimed Idle :29:47 LINUX INTEL Unclaimed Idle :07:45 LINUX INTEL Unclaimed Idle :43:11 LINUX INTEL Unclaimed Idle :43:12 LINUX INTEL Unclaimed Idle :43:05 LINUX INTEL Unclaimed Idle :35:22 LINUX INTEL Unclaimed Idle :45:29 LINUX INTEL Unclaimed Idle :35:06... Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX Total

OSG Summer School 2011 That was a whirlwind tour! Let’s get some hands-on experience with Condor, to solidify this knowledge. 41 Goal: Check out our installation, run some basic jobs.

OSG Summer School 2011 Questions? Questions? Comments?  Feel free to ask me questions later: Alain Roy Upcoming sessions  9:45-10:30  Hands-on exercises  10:30 – 10:45  Break  10:45 – 12:15  More! 42