Presentation is loading. Please wait.

Presentation is loading. Please wait.

Douglas Thain INFN Bologna, December 2001

Similar presentations


Presentation on theme: "Douglas Thain INFN Bologna, December 2001"— Presentation transcript:

1 Douglas Thain thain@cs.wisc.edu INFN Bologna, December 2001
Farming with Condor Douglas Thain INFN Bologna, December 2001

2 Outline Introduction Components Short Example Complications
What is Condor? Why Condor on the Farm? Components Daemons, pools, flocks, ClassAds Short Example Executing 1000 jobs. Complications Firewalls, security, etc…

3 The Condor Project (Est. 1985)
Distributed systems CS research performed by a team that faces software engineering challenges in a UNIX/Linux/NT environment, active interaction with users and collaborators, daily maintenance and support challenges of a distributed production environment, and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School

4 A Bird of Opportunity “I am “I have idle.” work.”
Busy “I am idle.” “I have work.” Job Job Idle Central Manager Busy “I am idle.” Over the course of a week, 80% of a desktop machine’s time is wasted. Job Idle

5 The Condor Principle: The Condor Corollary: The owner is absolutely
in charge! The Condor Corollary: The visitor must be prepared for the unexpected!

6 Tricky Details What if the user returns?
Checkpoint the job periodically. Restart the job elsewhere from a checkpoint. What if the machine does not have your files? Perform I/O via Remote System Calls These two features require that you link with the Condor C library. Can’t relink? You may still use Condor, but with some loss in opportunities.

7 Checkpointing Job Checkpoint Restart Job

8 Remote System Calls Just like home! Disk Remote System Calls Job
Shadow Remote System Calls Job Disk

9 The INFN Condor Pool

10 Top 10 Condor Pools: 226 Condor Pools 5576 Condor Hosts

11 Back to the Farm The cluster is the new engine of scientific computing. Inexpensive to: procure expand repair

12 The Ideal Cluster The ideal cluster has every node identical, in every way: CPU Memory File system User accounts Software installation Users expect to be able to execute on any node. Some models (MPI) require perfectly matched nodes.

13 The Bad News Keeping the entire cluster available for use is very difficult, when users expect complete symmetry! Software failures: Full disk, wild process, etc... Hardware failures: Replace with exact match? (not best buy) Replace with better hardware? (goes unused) Much better to query rather than assume state of the cluster.

14 High Throughput Computing is a 24-7-365 activity.
FLOPY  (60*60*24*7*52)*FLOPS

15 Why Condor on the Farm? Condor is expert at managing very heterogeneous resources for high-throughput computing. Large clusters, despite our best efforts, will always be slightly heterogeneous. (It may not be in your financial interests to keep them perfectly homogeneous.) Condor assists users in making progress, despite the imperfections of the cluster. Few users *require* the whole identical cluster. The pursuit of cluster perfection is then an in issue of small throughput improvement, rather than 0 or max.

16 Basic HTC Mechanisms Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). Persistence - records are kept in stable storage -- any component may crash and reboot. Asynchronous API - enables management of dynamic (opportunistic) resources. Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!). Remote I/O - enables remote (from execution site) access to local (at submission site) data.

17 City Bird, Country Farm The lessons learned and techniques used in stealing cycles from workstations are just as important when trying to maximize the throughput of a homogeneous luster.

18 Outline Introduction Components Short Example Complications
What is Condor? Why Condor on the Farm? Components Daemons, pools, flocks, ClassAds Short Example Executing 1000 jobs. Complications Firewalls, security, etc…

19 Components Condor can be quite complicated:
Many daemons, many connections, many logs... The complexity is necessary and desirable: Each process represents an independent interest: Machine requirements (startd) User requirements (schedd) System requirements (central manager) Explain the structure by working from the bottom up.

20 Machine state and policy.
A Single Machine administrator Central Manager “Some-thing is wrong!” Machine state and policy. condor master “Only run jobs submitted from Bologna or Milan. Prefer jobs owned by thain. Evict jobs that don’t fit in memory. “ condor startd User present? Speed? Load? Size? Avail? Size? disk Local policy file cpu RAM keyboard

21 Machine state and policy. Machine state and policy.
A Single Pool condor startd disk RAM cpu condor startd disk RAM cpu condor startd disk RAM cpu Machine state and policy. Central Manager Machine state and policy. condor startd disk RAM cpu condor startd disk RAM cpu condor startd disk RAM cpu Global Policy: “All things being equal, Bologna gets 2x as many machines as Milan.” Local Policy: “I prefer thain” Local Policy: “I don’t care.” Local Policy: “I prefer mazzanti”

22 A Typical Pool Central Manager Global Policy:
condor startd RAM cpu condor startd RAM cpu condor startd cpu RAM Central Manager condor startd RAM cpu condor startd RAM cpu condor startd RAM cpu Global Policy: “All things being equal, Bologna gets 2x as many machines as Milan.” Uniform Local Policy: “All machines except #3 prefer mazzanti” NFS / AFS Server disk cpu RAM

23 Schedulers Central Manager condor startd condor startd condor schedd
Job condor startd RAM cpu condor startd RAM cpu condor schedd Job Job I am idle. I have work. condor startd RAM cpu condor startd RAM cpu Central Manager I am idle. I have work. I am idle. condor schedd condor startd RAM cpu condor startd RAM cpu Job

24 Multiple Pools INFN Central Manager UWCS Central Manager condor schedd
startd RAM cpu Job condor schedd INFN Central Manager Job UWCS Central Manager condor startd RAM cpu Job condor startd RAM cpu Job

25 Matchmaking Each Central Manager is an introduction service that matches compatible machines and jobs. A simple language (ClassAds) is used to represent everyone’s needs and desires. The match is not binding contract -- each side is responsible for enforcing its own needs. If a central manager crashes, jobs will continue to run, but no further introductions are made.

26 ClassAd Example Job Ad: Type = “Job” Cmd = “cmsim.exe” Owner = “thain”
Requirements = (OpSys==LINUX) && (Memory>128) Machine Ad: Type = “Machine” Name = “vulture” OpSys = “LINUX” Memory = 256 Requirements = (Owner==“thain”)

27 Matchmaking with ClassAds
Central Manager Job Ad I have work. Machine Ad I am idle. match Match notification Schedd Startd Claim and execute Execute again. …and again!

28 Placement vs. Scheduling
A Condor Central Manager suggests the placement of jobs on machines, with the understanding that all jobs are ready to run. A Condor scheduler is responsible for executing a list of jobs with various requirements. It may order jobs according to the users requests. Neither component plans ahead to make a schedule or a reservation for execution -- it is assumed change is so frequent that schedules are not useful.

29 Can we Schedule? Of course, schedule is important for users that have strict time contraints. Scheduling is more important to High-Performance Computing (HPC) than High-Throughput Computing (HTC.) Scheduling requirements may be worked into Condor in one of two ways: 1 - Users may share a single submission point. 2 - The administrator may periodically reconfigure policy according to a schedule established elsewhere.

30 Scheduling Method 1: All users share a schedd.
condor schedd condor startd RAM cpu condor startd RAM cpu Job Job I am idle. Method 1: All users share a schedd. condor startd RAM cpu condor startd RAM cpu Job Central Manager I am idle. I am idle. Method 2: Modify global policy when necessary. 8:00: All nodes prefer thain. 10:00: All nodes prefer mazzanti. condor startd RAM cpu condor startd RAM cpu Job

31 Outline Introduction Components Short Example Complications
What is Condor? Why Condor on the Farm? Components Daemons, pools, flocks, ClassAds Short Example Executing 1000 jobs. Complications Firewalls, security, etc…

32 How Many Machines? % condor_status
Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle axpd21.pd.inf OSF ALPHA Owner Idle vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy . . . Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF INTEL/LINUX INTEL/LINUX-GLIBC SUN4u/SOLARIS SUN4u/SOLARIS SUN4u/SOLARIS SUN4x/SOLARIS Total

33 Submit the Job Create a submit file: Submit the job: vi sim.submit
condor_submit sim.submit Executable = sim Input = sim.in Output = sim.out Log = sim.log queue

34 Watch the Progress % condor_q Each job gets a unique number.
-- Submitter: axpbo8.bo.infn.it : < :1038> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain /21 12: :00:15 R sim.exe Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)

35 Receive E-mail When Done
This is an automated from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/sim 40 exited with status 0. Submitted at: Wed Jun 21 14:24: Completed at: Wed Jun 21 14:36: Real Time: :11:54 Run Time: :06:52 Committed Time: :01:37 . . .

36 Running Many Processes
The real benefit of Condor comes from managing 1000s of jobs. First, get organized. Write a script to make 1000 input files. Now, simply adjust your submit file: Executable = sim.exe Input = sim.in.$(PROCESS) Output = sim.out.$(PROCESS) Log = sim.log Queue 1000

37 What can go wrong? If an execution site crashes:
Your job will restart elsewhere. If the central manager crashes: Jobs will continue to run, no new matches will be made. If the submit machine crashes: Jobs will stop, but be re-started when it reboots. The only way to lose a job is to throw away the disk on the submit machine!

38 Outline Introduction Components Short Example Complications
What is Condor? Why Condor on the Farm? Components Daemons, pools, flocks, ClassAds Short Example Executing 1000 jobs. Complications Firewalls, security, etc…

39 Firewalls Why a firewall? What’s the problem?
Prevent all outside contact. Prevent non-approved contact. Carefully securing every node is too much work. What’s the problem? A variety of processes comprise Condor. A variety of ports must be used at once. Submit and execute machines must communicate directly, not through the CM.

40 The Firewall Problem Firewall Public Network Private Network Central
condor startd RAM cpu condor startd RAM cpu Central Manager condor startd RAM cpu condor startd RAM cpu Public Network Private Network condor schedd condor startd RAM cpu condor startd RAM cpu

41 Firewall Solution #1 Allow ports 1000-1010. Use only ports 1000-1010.
condor startd RAM cpu condor startd RAM cpu Central Manager Use only ports condor startd RAM cpu condor startd RAM cpu Public Network Private Network condor schedd condor startd RAM cpu condor startd RAM cpu

42 Firewall Solution #1 Pros: Cons: Easy to configure Condor.
Easy to configure firewall. Machine remain a part of the pool. Cons: Number of ports limits number of simultaneous interactions with the node. (running jobs + queue ops + negotiations, etc.) More ports = more connections, less security

43 Firewall Solution #2 Private Network Firewall Public Network ssh
condor startd RAM cpu condor startd RAM cpu condor schedd Central Manager condor startd RAM cpu condor startd RAM cpu Public Network ssh condor startd RAM cpu condor startd RAM cpu

44 Firewall Solution #2 Pros: Cons: Only port through router is ssh.
Pool is partitioned! Users must manually submit to every pool that is behind a firewall. (I.e. they won’t.) No global policy possible. No global management/status possible.

45 Network Address Translation
Both solutions only work as long as the firewall simply drops packets it doesn’t like. If the firewall is a Network Address Translator (masquerade,) then only solution #2 works. Research in Progress: A Condor NAT that runs on the firewall and exports the pool to the outside world.

46 Security Current Condor security: New Condor (6.3.X?) security:
Authenticate via DNS. Authorize classes of hosts for certain tasks. New Condor (6.3.X?) security: Authenticate with encrypted credentials. Authorize on a per-user basis. Forward credentials to necessary sites.

47 Condor 6.2 Security Authentication: DNS is queried for each incoming connection in order to determine the name. Authorization: Each participant permits a class of hosts to perform certain tasks. At UW-CS: HOSTALLOW_READ = *.wisc.edu, *.infn.it Hosts that may query the machine state. HOSTALLOW_WRITE = *.cs.wisc.edu, *.infn.it Hosts that may execute jobs, send updates, etc... HOSTALLOW_OWNER= $(FULL_HOSTNAME) Hosts that may cause this machine to vacate. HOSTALLOW_ADMINISTRATOR= condor.cs.wisc.edu Hosts that may change priorities, turn Condor on/off

48 Condor 6.3.X? Security Principle: No single security mechanism is appropriate for all sites. Condor must have many tools. United States Air Force: Kerberos authentication, all connections encrypted Cluster behind a firewall: Host authentication, no encryption Grid Computing: GSI credentials from certain authorities, encryption is up to the user.

49 Condor 6.3.X Security Execute I/O Submit Disk Central Manager condor
schedd RAM cpu condor startd RAM cpu Execute GSI ? YES! GSI KRB 5 ? GSI ? FORWARD CERT I/O NO YES! Submit Data storage Disk

50 You don’t have to be a super person to do super computing!

51 Getting Condor Condor Home Page Binaries are freely available.
Binaries are freely available. Versions: 6.2.x - Stable releases, bug fixes only 6.3.x - Development releases

52 For More Info Condor Home Page These slides: Douglas Thain
These slides: Douglas Thain Questions Now?


Download ppt "Douglas Thain INFN Bologna, December 2001"

Similar presentations


Ads by Google