Presentation on theme: "Parallel Session B2 - CPU and Resource Allocation " Panelists: Charles Young (BaBar) David Bigagli " Seed Questions: Batch queuing system in use?"— Presentation transcript:
Parallel Session B2 - CPU and Resource Allocation " Panelists: Charles Young (BaBar) David Bigagli " Seed Questions: Batch queuing system in use? Turnaround guarantees? Pre-allocation of resources?
Batch System " Vendor or home-brewed? Maintenance and support issues. " If vendor, licensing and cost issues. Already significant fraction of H/W. Per node? Per unit computing power? Or? " Management concerns. Can one really management 10K nodes? Split into separate management domains? Slide from Charles Young (BaBar)
LSF - Talk given by David Bigagli (Platform Computing) " What is LSF " Developer's view of LSF architecture " Scalability " Dealing with resources " Load Information manager " Batch " Lively discussion
Discussion: which batch systems are you using and why? " LSF (~30% ?) LSF has worked for us and continues to work. " PBS (~30%) PBS is free Collaborators want us to use PBS (because PBS is free) Ability to modify source Nobody is using ProPBS " Condor (~20%) Condor costs nothing Cycle stealing allows us to get computing done
Discussion: which batch systems continued... " BQS Homegrown at IN2P3 " used in a small number of external sites Have had it for seven years Everyone likes it " FBS Homegrown at FNAL " used in some external sites LSF is expensive FBS is designed to be used on farms Lightweight and flexible
Other issues " Mosix Only CERN has looked at it " Appears to be difficult to take down individual machines in the cluster " How to deal with abusers? Turn them over to the user community " How do people schedule downtime ? Train people that jobs longer than 24 hours are at risk. CERN posts a future shutdown time for the job starter (internal) BQS has this feature inside. Condor has eventd for draining. Labs reboot and have maintenance windows.