Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.

Similar presentations


Presentation on theme: "Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005."— Presentation transcript:

1 Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005

2 About Brookhaven National Lab ● One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE. ● Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department. ● Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects. ● RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

3 Computing Facility Resources ● Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc. ● 6+ PB permanent tape storage capacity. ● 500+ TB central/distributed disk storage capacity. ● 1.4 million SpecInt2000 aggregrate computing power in Linux Farm.

4 History of Condor at Brookhaven ● First looked at Condor in 2003 as a replacement for LSF and in-house batch software. ● Installed 6.4.7 in August 2003. ● Upgraded to 6.6.0 in February 2004. ● Upgraded to 6.6.6 (with 6.7.0 startd binary) in August 2004. ● User base grew from 12 (April 2004) to 50+ (March 2005).

5 The Rise in Condor Usage

6

7 Condor Cluster Usage

8 BNL’s modified Condorview

9 Overview of Computing Resources ● Total of 2750 CPUs (growing to 3400+ in 2005). ● Two central managers with one acting as a backup. ● Three specialized submit machines which handle ~600 simultaneous jobs each on average. ● 131 of the execute nodes can also act as submission nodes. ● One monitoring/Condorview server.

10 Overview of Computing Resources, cont. ● Six GLOBUS gateway machines for remote job submission. ● Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3. ● Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.

11 Overview of Configuration ● Computing resources divided into 6 pools. ● Two configuration models: – Split pool resources into two parts and restrict which jobs can run in each part. – More complex version of the Bologna Batch System. – A pool uses one or both of these models. ● Some pools employ user priority preemption. ● Use “drop queue” method to fill fast machines first. ● Have tools to easily reconfigure nodes. ● All jobs use vanilla universe (no checkpointing).

12 Two Part Model ● Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction. ● Within Condor, a node advertises itself as either an analysis node or a reconstruction node. ● A job must advertise itself in the same manner to match with an appropriate node. ● Only certain users may run reconstruction jobs but anyone can run an analysis job.

13 Analysis/Reconstruction Group 3 Group 2 Group 1 Fast Slow vm1 vm2 ● No suspension ● No preemption ● Will start a job if CPU is free Group 1 Group 2 Group 3 Group 4 Group 5 Reconstruction Job: wants group <= 2

14 A More Complex Version of the Bologna Model ● Two CPU nodes each with 8 VMs. ● 2 VMs per CPU. ● Only two jobs running at a time. ● Four job categories, each with its own priority. ● A high priority VM will suspend a random VM of lower priority. ● The random aspect is to prevent the same VM from always getting suspended.

15 Analysis/Reconstruction Group 3 Group 2 Group 1 Fast Slow ● Low priority VMs suspended ● No preemption ● Will start a job if CPU is free or is of higher priority Group 1 Group 2 Group 3 Group 4 Group 5 Reconstruction Job: wants group == 3 Med. Priority (vm5/vm6) MC (vm1/vm2) Low (vm3/vm4) Med (vm5/vm6) High (vm7/vm8) High Prio Low Prio

16 Issues We've Had to Deal With ● Tune parameters to alleviate scalability problems. – MATCH_TIMEOUT – MAX_CLAIM_ALIVES_MISSED ● Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C  Panasas fixed bug. ● High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

17 Issues We’ve Had to Deal With, cont. ● Dagman problems (latency, termination)  changed from dagman for plain Condor. ● Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off). ● Modified Condorview to meet our accounting & monitoring requirements.

18 Issues Not Yet Resolved ● Need job ClassAd which gives user's primary group --> better control over cluster usage. ● Transfer output files for debugging when job is evicted. ● Need option to force the schedd to release its claim after each job. ● Allow schedd to set mandatory periodic_remove policy  avoid manual cleanup.

19 Issues Not Yet Resolved, cont. ● Shadow seems to make a large number of NIS calls. Possible problem with caching  address shadows in vanilla universe? ● Need Kerberos support to comply with security mandates. ● Interested in Condor on Demand (COD), but lack of functionality prevents more usage. ● Need more (and effective) cluster management tools  condor_off works?

20 Near-Term Plans & Summary ● Waiting for 6.8.x series (late 2005?) to upgrade. ● Scalability concerns as usage rises. ● High availability more critical as usage rises. ● Integration of BNL Condor pools with external pools, but concerned about security. ● Need some functionalities listed above for a meaningful upgrade and to improve cluster management capability.


Download ppt "Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005."

Similar presentations


Ads by Google