Farming with Condor Douglas Thain INFN Bologna, December 2001.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Douglas Thain Computer Sciences Department University of Wisconsin-Madison October Condor by Example.
Workload Management Massimo Sgaravatto INFN Padova.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Workload Management Workpackage
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
High Availability in HTCondor
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
Douglas Thain INFN Bologna, December 2001
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Presentation transcript:

Farming with Condor Douglas Thain INFN Bologna, December 2001

Outline Introduction –What is Condor? Why Condor on the Farm? Components –Daemons, pools, flocks, ClassAds Short Example –Executing 1000 jobs. Complications –Firewalls, security, etc…

The Condor Project (Est. 1985) Distributed systems CS research performed by a team that faces –software engineering challenges in a UNIX/Linux/NT environment, –active interaction with users and collaborators, –daily maintenance and support challenges of a distributed production environment, –and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School

A Bird of Opportunity Central Manager Job Busy Idle Job “I have work.” “I am idle.” “I am idle.” Over the course of a week, 80% of a desktop machine’s time is wasted.

The Condor Principle: The Condor Corollary: The owner is absolutely in charge! The visitor must be prepared for the unexpected!

Tricky Details What if the user returns? –Checkpoint the job periodically. –Restart the job elsewhere from a checkpoint. What if the machine does not have your files? –Perform I/O via Remote System Calls These two features require that you link with the Condor C library. Can’t relink? You may still use Condor, but with some loss in opportunities.

Checkpointing Job Checkpoint Restart Job

Remote System Calls Job Shadow Disk Remote System Calls Just like home!

The INFN Condor Pool

226 Condor Pools 5576 Condor Hosts Top 10 Condor Pools:

Back to the Farm The cluster is the new engine of scientific computing. Inexpensive to: –procure –expand –repair

The Ideal Cluster The ideal cluster has every node identical, in every way: –CPU –Memory –File system –User accounts –Software installation Users expect to be able to execute on any node. Some models (MPI) require perfectly matched nodes.

The Bad News Keeping the entire cluster available for use is very difficult, when users expect complete symmetry! Software failures: –Full disk, wild process, etc... Hardware failures: –Replace with exact match? (not best buy) –Replace with better hardware? (goes unused) Much better to query rather than assume state of the cluster.

High Throughput Computing is a activity. FLOPY  (60*60*24*7*52)*FLOPS

Why Condor on the Farm? Condor is expert at managing very heterogeneous resources for high-throughput computing. Large clusters, despite our best efforts, will always be slightly heterogeneous. –(It may not be in your financial interests to keep them perfectly homogeneous.) Condor assists users in making progress, despite the imperfections of the cluster. –Few users *require* the whole identical cluster. –The pursuit of cluster perfection is then an in issue of small throughput improvement, rather than 0 or max.

Basic HTC Mechanisms Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). Persistence - records are kept in stable storage -- any component may crash and reboot. Asynchronous API - enables management of dynamic (opportunistic) resources. Checkpointing - enables preemptive resume scheduling ( go ahead and use it as long as it is available!). Remote I/O - enables remote (from execution site) access to local (at submission site) data.

City Bird, Country Farm The lessons learned and techniques used in stealing cycles from workstations are just as important when trying to maximize the throughput of a homogeneous luster.

Outline Introduction –What is Condor? Why Condor on the Farm? Components –Daemons, pools, flocks, ClassAds Short Example –Executing 1000 jobs. Complications –Firewalls, security, etc…

Components Condor can be quite complicated: –Many daemons, many connections, many logs... The complexity is necessary and desirable: –Each process represents an independent interest: Machine requirements (startd) User requirements (schedd) System requirements (central manager) Explain the structure by working from the bottom up.

A Single Machine condor startd Local policy file disk RAMcpu condor master keyboard Central Manager Size? Speed? Load? User present? Size? Avail? Machine state and policy. “Only run jobs submitted from Bologna or Milan. Prefer jobs owned by thain. Evict jobs that don’t fit in memory. “ administrator “Some- thing is wrong!”

A Single Pool condor startd disk RAMcpu Central Manager condor startd disk RAMcpu condor startd disk RAMcpu condor startd disk RAMcpu condor startd disk RAMcpu condor startd disk RAMcpu Local Policy: “I prefer thain” Local Policy: “I don’t care.” Local Policy: “I prefer mazzanti” Machine state and policy. Global Policy: “All things being equal, Bologna gets 2x as many machines as Milan.”

A Typical Pool Central Manager condor startd RAMcpu Uniform Local Policy: “All machines except #3 prefer mazzanti” NFS / AFS Server disk RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Global Policy: “All things being equal, Bologna gets 2x as many machines as Milan.”

Schedulers Central Manager condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor schedd Job condor startd RAMcpu condor schedd Job I have work. I have work. I am idle. I am idle. I am idle. Job

Multiple Pools INFN Central Manager condor schedd Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job UWCS Central Manager

Matchmaking Each Central Manager is an introduction service that matches compatible machines and jobs. A simple language (ClassAds) is used to represent everyone’s needs and desires. The match is not binding contract -- each side is responsible for enforcing its own needs. If a central manager crashes, jobs will continue to run, but no further introductions are made.

ClassAd Example Job Ad: Type = “Job” Cmd = “cmsim.exe” Owner = “thain” Requirements = (OpSys==LINUX) && (Memory>128) Machine Ad: Type = “Machine” Name = “vulture” OpSys = “LINUX” Memory = 256 Requirements = (Owner==“thain”)

Matchmaking with ClassAds Startd Schedd match Central Manager Claim and execute Match notification Job Ad Job Ad I have work. Machine Ad Machine Ad I am idle. Execute again. …and again!

Placement vs. Scheduling A Condor Central Manager suggests the placement of jobs on machines, with the understanding that all jobs are ready to run. A Condor scheduler is responsible for executing a list of jobs with various requirements. It may order jobs according to the users requests. Neither component plans ahead to make a schedule or a reservation for execution -- it is assumed change is so frequent that schedules are not useful.

Can we Schedule? Of course, schedule is important for users that have strict time contraints. Scheduling is more important to High- Performance Computing (HPC) than High- Throughput Computing (HTC.) Scheduling requirements may be worked into Condor in one of two ways: –1 - Users may share a single submission point. –2 - The administrator may periodically reconfigure policy according to a schedule established elsewhere.

Scheduling Central Manager condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor schedd Job condor startd RAMcpu Job I am idle. I am idle. I am idle. Job Method 1: All users share a schedd. Method 2: Modify global policy when necessary. 8:00: All nodes prefer thain. 10:00: All nodes prefer mazzanti.

Outline Introduction –What is Condor? Why Condor on the Farm? Components –Daemons, pools, flocks, ClassAds Short Example –Executing 1000 jobs. Complications –Firewalls, security, etc…

How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle axpd21.pd.inf OSF1 ALPHA Owner Idle vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF INTEL/LINUX INTEL/LINUX-GLIBC SUN4u/SOLARIS SUN4u/SOLARIS SUN4u/SOLARIS SUN4x/SOLARIS Total

Submit the Job Create a submit file: vi sim.submit Submit the job: condor_submit sim.submit Executable = sim Input = sim.in Output = sim.out Log = sim.log queue

Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12: :00:15 R sim.exe Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)

Receive When Done This is an automated from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/sim 40 exited with status 0. Submitted at: Wed Jun 21 14:24: Completed at: Wed Jun 21 14:36: Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37...

Running Many Processes The real benefit of Condor comes from managing 1000s of jobs. First, get organized. Write a script to make 1000 input files. Now, simply adjust your submit file: Executable = sim.exe Input = sim.in.$(PROCESS) Output = sim.out.$(PROCESS) Log = sim.log Queue 1000

What can go wrong? If an execution site crashes: –Your job will restart elsewhere. If the central manager crashes: –Jobs will continue to run, no new matches will be made. If the submit machine crashes: –Jobs will stop, but be re-started when it reboots. The only way to lose a job is to throw away the disk on the submit machine!

Outline Introduction –What is Condor? Why Condor on the Farm? Components –Daemons, pools, flocks, ClassAds Short Example –Executing 1000 jobs. Complications –Firewalls, security, etc…

Firewalls Why a firewall? –Prevent all outside contact. –Prevent non-approved contact. –Carefully securing every node is too much work. What’s the problem? –A variety of processes comprise Condor. –A variety of ports must be used at once. –Submit and execute machines must communicate directly, not through the CM.

The Firewall Problem Central Manager condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor schedd condor startd RAMcpu Firewall Private Network Public Network

Firewall Solution #1 Central Manager condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor schedd condor startd RAMcpu Firewall Private Network Public Network Allow ports Use only ports

Firewall Solution #1 Pros: –Easy to configure Condor. –Easy to configure firewall. –Machine remain a part of the pool. Cons: –Number of ports limits number of simultaneous interactions with the node. (running jobs + queue ops + negotiations, etc.) –More ports = more connections, less security

Firewall Solution #2 Central Manager condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Firewall Private Network Public Network condor schedd ssh

Firewall Solution #2 Pros: –Only port through router is ssh. Cons: –Pool is partitioned! –Users must manually submit to every pool that is behind a firewall. (I.e. they won’t.) –No global policy possible. –No global management/status possible.

Network Address Translation Both solutions only work as long as the firewall simply drops packets it doesn’t like. If the firewall is a Network Address Translator (masquerade,) then only solution #2 works. Research in Progress: A Condor NAT that runs on the firewall and exports the pool to the outside world.

Security Current Condor security: –Authenticate via DNS. –Authorize classes of hosts for certain tasks. New Condor (6.3.X?) security: –Authenticate with encrypted credentials. –Authorize on a per-user basis. –Forward credentials to necessary sites.

Condor 6.2 Security Authentication: DNS is queried for each incoming connection in order to determine the name. Authorization: Each participant permits a class of hosts to perform certain tasks. At UW-CS: –HOSTALLOW_READ = *.wisc.edu, *.infn.it Hosts that may query the machine state. –HOSTALLOW_WRITE = *.cs.wisc.edu, *.infn.it Hosts that may execute jobs, send updates, etc... –HOSTALLOW_OWNER= $(FULL_HOSTNAME) Hosts that may cause this machine to vacate. –HOSTALLOW_ADMINISTRATOR= condor.cs.wisc.edu Hosts that may change priorities, turn Condor on/off

Condor 6.3.X? Security Principle: No single security mechanism is appropriate for all sites. Condor must have many tools. –United States Air Force: Kerberos authentication, all connections encrypted – Cluster behind a firewall: Host authentication, no encryption –Grid Computing: GSI credentials from certain authorities, encryption is up to the user.

Condor 6.3.X Security Central Manager Data storage condor startd RAMcpu condor schedd RAMcpu Disk Execute I/O KRB 5 ? NO GSI ? YES! Submit GSI ? YES! FORWARD CERT GSI

You don’t have to be a super person to do super computing!

Getting Condor Condor Home Page – Binaries are freely available. Versions: –6.2.x - Stable releases, bug fixes only –6.3.x - Development releases

For More Info Condor Home Page – These slides: – Douglas Thain Questions Now?