Presentation on theme: "Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008."— Presentation transcript:
Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008
Slide 2 Current Default Setup YAIM Sets up by Default –One Cluster (Batch System) –One SubCluster (Set of WNs) –Multiple CEs (queues) pointing to the subcluster Falls down with –Non Identical Worker Nodes –Multiple CENodes attached to the same batch system
Slide 3 CE Node The way its supposed to be Type 1 WN SubClus Type 2 WN SubClus Type 3 WN SubClus Cluster Q1 Q2 Q3 Q4 Q5 Tags
Slide 4 The way it usually is Type 1 WN Type 21 WN Type 31 WN CE Node SubClus Cluster Q1 Q2 Q3 Q4 Q5 Tags
Slide 5 CE Node How bad it can be Type 1 WN Type 21 WN Type 31 WN CE Node SubClus Cluster Q1 Q2 Q3 Q4 SubClus Cluster Q1 Q2 Q3 Q4 Tags
Slide 6 Problem on Non Identical Worker Nodes Default setup assumes that all worker nodes are identical –Obviously no the case at most sites –Subcluster has to publish the lowest spec WN Leads to: –Small memory jobs wasting large memory nodes –Inability to publish existence of large memory nodes –Differing CPU specs lead to inaccurate timing and accounting (CPU scaling helps here)
Slide 7 Problem of multiple CENodes Sites want to add multiple CENodes for Scaling and Redundancy –Should just add CEs (queue endpoints) –Currently duplicates Clusters and SubClusters Causes problems in CPU counting (gStat, GridMap, Accounting Reports, etc.) Various hacks to try to help with this
Slide 8 Current Hacks Can already set up multiple Clusters, SubClusters to advertise different memory queues –See publishing for RAL-LCG2 and UKI-SOUTHGRID- RALPP –Involves hand crafted ldif files to set up (Sub)Clusters and map queues to them –Cannot let YAIM near them
Slide 9 Traylen Proposal Move (Sub)Cluster publishing from CENode to new node type –Probably share node with site-bdii CENode gip will associate queues to SubClusters Software Tags currently associated with CENode not (Sub)Cluster, theyll be fixed and published through the new node type.
Slide 10 How it may be Type 1 WN Type 2 WN Type 3 WN Glite-Cluster Node SubClus Cluster CE Node SubClus Q1Q2 Q3 Q4 Q5 Tags SubClus Tags CE Node Q1Q2 Q3 Q4 Q5 CE Node Q1Q2 Q3 Q4 Q5
Slide 11 Our Experience Weve put in hand crafted ldif files to define 500MB, 1000MB and 2000MB SubClusters grid[500|1000|2000] queues pointing at them on both CENodes Technically it works – jobs with higher memory requirements only match the high memory queues In practice it makes no difference – almost no jobs include memory requirements
Slide 12 Conclusion Youre probably not doing it right at the moment –But the fix is probably worse You can add hacks to provide more info to the batch system –But it probably wont make any difference Things are likely to change (for the better) in the near future –Wait until then
Your consent to our cookies if you continue to use this website.