Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,

Similar presentations


Presentation on theme: "Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,"— Presentation transcript:

1 Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman, Frank Würthwein, University of California, San Diego Subir Sarkar, Igor Sfiligoi, INFN

2 Condor Week 2006, University of Wisconsin 2 Matthew Norman Single Point of Submission system ● CAF Daemons accept and authenticate user jobs, handle output. ● Condor Schedd does the real job Traditional CDF use of Condor CAF (Portal) Worker Node (WN) Startd Collector Schedd Submitter Daemons Negotiator Legacy protocol Monitoring Daemons CDF Analysis Farms (CAF)...

3 Condor Week 2006, University of Wisconsin 3 Matthew Norman Need to move to the Grid “The Grid” ● HEP moving to the Grid ● Nobody wants to finance dedicated nodes ● Need to move ● Want to preserve user interface

4 Condor Week 2006, University of Wisconsin 4 Matthew Norman Works, but... ● No central matchmaking ● No control over priorities ● Site selection problem ● Black holes can eat most of the user jobs Plain Condor-G? Grid Pool Schedd Globus Submitter Daemons Grid Pool Batch Slot Globus User Job Batch queue

5 Condor Week 2006, University of Wisconsin 5 Matthew Norman Turn shared Grid resources into temporary private condor resources: ● Submit as a standard Grid job. ● Advertise themselves as a Condor batch slot. ● Run a user job Condor Glide-Ins Grid Pool Collector Schedd Negotiator 1) Submit glidein to the Grid 3) Startd advertises itself to Collector 4) User job sent to startd Globus Grid Pool Available Batch Slot Glide-in Startd Globus Batch queue 2) Glide-in starts running on a batch slot

6 Condor Week 2006, University of Wisconsin 6 Matthew Norman GlideCAF (Portal) CDF use of Glide-Ins Collector Main Schedd Submitter Daemons Negotiator Glidekeeper Daemon Checks to see if jobs are queued. Glide-in Schedd If jobs are queued, a glide-in is submitted to a second schedd. Glide-ins to Grid Startds registering User Jobs ● Condor allows single point of submission to Grid sites. ● Matchmaking done globally under CDF control. ● Users are presented with a familiar interface. ● Jobs run as Condor jobs with all features (such as Condor on Demand) regardless of underlying site batch system. ● Black holes eat glide-ins, not user jobs Monitoring Daemons (1) (2) (3) (4) (5) (6)

7 Condor Week 2006, University of Wisconsin 7 Matthew Norman Glide-Ins are just regular Startds condor_config DAEMON_LIST = MASTER,STARTD NEGOTIATOR_HOST = $(HEAD_NODE) COLLECTOR_HOST = $(HEAD_NODE) MaxJobRetirementTime=$(SHUTDOWN_GRACEFUL_TIMEOUT) SHUTDOWN_GRACEFUL_TIMEOUT=288000 # How long will it wait in an unclaimed state before exiting STARTD_NOCLAIM_SHUTDOWN = 1200 HEAD_NODE = cdfhead.fnal.gov glidein_startup.sh validate_node() local_config()./condor_master -r $retmins -dyn -f

8 Condor Week 2006, University of Wisconsin 8 Matthew Norman Delivering files with Condor-G glidein.submit Universe = Globus GlobusScheduler = mysite/jobmanager-mybatch Executable = glidein_startup.sh transfer_Input_files = condor.tgz Queue Cannot use Condor-G transfer mechanism! Only the executable guaranteed to be transfered on EGEE.

9 Condor Week 2006, University of Wisconsin 9 Matthew Norman Use a Web server for file delivery glidein_startup.sh validate_node() wget http://cdfhead.fnal.gov/condor.tgz sha1sum knownSHA1 condor.tgz if [ $? -eq 0 ]; then tar -xzf condor.tgz local_config()./condor_master -r $retmins -dyn -f fi GlideCAF (Portal) Collector Main Schedd Glidekeeper Daemon Glide-in Schedd HTTPd (1) (2) (3) (4)

10 Condor Week 2006, University of Wisconsin 10 Matthew Norman HTTP Proxy can reduce traffic to head node glidein_startup.sh validate_node() env http_proxy=cdfsquid1.fnal.gov wget http://cdfhead.fnal.gov/condor.tgz sha1sum knownSHA1 condor.tgz if [ $? -eq 0 ]; then tar -xzf condor.tgz local_config()./condor_master -r $retmins -dyn -f fi GlideCAF (Portal) Collector Main Schedd Glidekeeper Daemon Glide-in Schedd HTTPd (1) (2) (3) (4) HTTP Proxy (once)

11 Condor Week 2006, University of Wisconsin 11 Matthew Norman Several GlideCAFs deployed CNAF six month history plot: Note over a thousand VMs used for long period of time. SDSC History Plot showing number of VMs run. CNAF (Bologna, Italy) SDSC (San Diego, CA, USA) Fermilab (Batavia, IL, USA)IN2P3 (Lyon, France)MIT (Boston, MA, USA)

12 Condor Week 2006, University of Wisconsin 12 Matthew Norman Several GlideCAFs deployed CNAF six month history plot: Note over a thousand VMs used for long period of time. SDSC History Plot showing number of VMs run. CNAF (Bologna, Italy) SDSC (San Diego, CA, USA) Fermilab (Batavia, IL, USA)IN2P3 (Lyon, France)MIT (Boston, MA, USA) CDF head node near the worker nodes

13 Condor Week 2006, University of Wisconsin 13 Matthew Norman Gliding to multiple sites a problem Collector Schedd Available Batch Slot Startd Available Batch Slot Startd Globus Available Batch Slot Startd Firewall WAN ? ? UDP packets often lost over WAN Firewalls and NATs make it impossible Globus

14 Condor Week 2006, University of Wisconsin 14 Matthew Norman The Solution? GCB and TCP Collector Schedd Available Batch Slot Startd Available Batch Slot Startd Globus Available Batch Slot Startd Globus Firewall WAN TCP TCP robust over WAN With GCB, firewalls are no longer a problem GCB Globus

15 Condor Week 2006, University of Wisconsin 15 Matthew Norman GCB ● Now shipped with current Condor development versions ● Designed to allow connections to pass over firewalls ● Part of Condor – No software to install on Grid site Generic Connection Brokering(GCB) Firewall Collector Schedd Grid Pool GCB Broker 1) Establish a persistent TCP connection to a GCB TCP 3) To communicate, schedd sends own address to GCB Startd Globus TCP 2) Advertise itself and the GCB connection 4) Schedd address forwarded TCP 5) Establish a TCP connection to the schedd 6) A job can be sent to the startd GCB must be on a public network

16 Condor Week 2006, University of Wisconsin 16 Matthew Norman glidein_startup.sh validate_node() wget http://cdfhead.fnal.gov/condor.tgz sha1sum knownSHA1 condor.tgz if [ $? -eq 0 ]; then tar -xzf condor.tgz local_config()./condor_master -r $retmins -dyn -f fi Enabling GCB just a matter of configuration condor_config DAEMON_LIST = MASTER,STARTD NEGOTIATOR_HOST = $(HEAD_NODE) COLLECTOR_HOST = $(HEAD_NODE) MaxJobRetirementTime=$(SHUTDOWN_GRACEFUL_TIMEOUT) SHUTDOWN_GRACEFUL_TIMEOUT=288000 # How long will it wait in an unclaimed state before exiting STARTD_NOCLAIM_SHUTDOWN = 1200 HEAD_NODE = cdfhead.fnal.gov # GCB configuration NET_REMAP_ENABLE = true NET_REMAP_SERVICE = GCB NET_REMAP_INAGENT = cdfgcb1.fnal.gov

17 Condor Week 2006, University of Wisconsin 17 Matthew Norman CDF on the Grid Collector Schedd Available Batch Slot Startd Available Batch Slot Startd Globus Available Batch Slot Startd Globus WAN Globus GCB Glide-in Schedd HTTPd HTTP Proxy

18 Condor Week 2006, University of Wisconsin 18 Matthew Norman Status of the GCB-based Condor pool GCB Performance is currently being tested: ● Can coordinate up to 400-500 VMs from a single modified GCB node ● Functions well at all sites that allow outbound connectivity Future plans ● Latest official version had scaling problems: we are testing a pre-release ● Attempts will be made to scale past several hundred VMs GCB Plot goes here Scaling test : ~600 VMs running on multiple grid sites

19 Condor Week 2006, University of Wisconsin 19 Matthew Norman Future Development We would like to use condor to combine our current clusters into one super-cluster. This means testing to confirm that Condor scales to the appropriate level. Current sum total resources

20 Condor Week 2006, University of Wisconsin 20 Matthew Norman Summary Using Condor took CDF to the Grid ● Switching to glide-ins was easy ● Matchmaking done globally under CDF control ● Late binding removes the necessity to choose between sites ● No need for any add-ons at the Grid site ● We could preserve our interface ● We look forward to working with Condor in the future


Download ppt "Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,"

Similar presentations


Ads by Google