Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.

Similar presentations


Presentation on theme: "“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT."— Presentation transcript:

1 “Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT

2 2002/10/25HEPiX fall 2002: Contents  The road to shared clusters  Batch cluster  Configuration  User challenges  Addressing the challenges  Interactive cluster  Load balancing  Conclusions

3 2002/10/25HEPiX fall 2002: The Demise of Free Choice

4 2002/10/25HEPiX fall 2002: Cluster Aggregation

5 2002/10/25HEPiX fall 2002: Organisational Compromises  Clusters per Groups  Sized for the average  users  Sized for user peaks  users  financiers : wasted resources  Invest effort in recooperating cycles for other groups  Configuration differences / specialities  Bulk Production Clusters  Production fluctuations dwarf those in user anal  Complex cross-submission links

6 2002/10/25HEPiX fall 2002: Production Farm: Planning

7 2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers

8 2002/10/25HEPiX fall 2002: Simple, Uniform Shared Cluster ?

9 2002/10/25HEPiX fall 2002:  Partitioning  Still have identified resources  Uniform configuration  Sharing  Repartitioning or soak-up queues  If owner experiment reclaims resources, must suspend soak-up jobs – stranded jobs ALICEATLASCMSLHCbALEPHDELPHIL3OPALCOMPASSNtofOPERASLAPPARCPARC IntCVSBUILDDELPHI IntCSFPublic

10 2002/10/25HEPiX fall 2002: LSF Fair-Share  Trade-in partition for a share  Multilevel  ATLAS 10%, CMS 12%, …  cmsprod 45%, HiggsWG 15%, …  usera 10%, userb 80%, userc 10%  Extra shares for productions  Effort: Juggling resources to Accounting  Demonstrating fairness  Protecting  Policing

11 2002/10/25HEPiX fall 2002: Facts and Figures  Accounting  LSF job records  Process with C-program  Load into Oracle DB  Prepare plots/tables with Crystal Reports package  LSFAnalyser ?  Monitoring  Poll the user access tools  SiteAssure ?

12 2002/10/25HEPiX fall 2002: CPU Time / Week Merged user analysis and production farms

13 2002/10/25HEPiX fall 2002: Performance of Batch Job Slot Analysis ThuFriSa 10 min / tick

14 2002/10/25HEPiX fall 2002: Challenging Batch (I)  Probing boundaries  Flooding  Concurrent starts  Uncontrolled status polling  Hitting limits  Disk space /tmp /pool /var  Memory, Swap Full  Guarantees for other user jobs?  System Issues  Queue drainers

15 2002/10/25HEPiX fall 2002: Challenging Batch (II)  Un-Fair-Share  Logging onto batch machines  Batch jobs which resubmit themselves  Forking sessions back to remote hosts  Wasting resources  Spawning processes which outlive the jobs  Sleeping processes  Copying large AFS trees  Establishing connections to dead machines

16 2002/10/25HEPiX fall 2002: Counter Measures  File system quotas  Virtual memory limits  Concurrent jobs limits per user/group  Restricted access through PAM  Instant response queues  Master node setup  Dedicated, 1GB memory  Failover cluster

17 2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers LSF MultiCluster

18 2002/10/25HEPiX fall 2002: Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape Batch Servers 70 Interactive Servers 120 Disk Servers Single Cluster

19 2002/10/25HEPiX fall 2002: Interactive Cluster  DNS load balancing (ISS)  Weighted load indexes  load, memory  swap rate, disk IO rate  # processes, # sessions, # window mgr sessions  Exclusion thresholds  file systems full, nologins  DNS publish 2 every 30 seconds  Random from lowest 5

20 2002/10/25HEPiX fall 2002: Daily Users 35 users / node

21 2002/10/25HEPiX fall 2002: Challenging Interactive  Sidestep load balancing  Parallel sessions across farm  Running daemons  Brutal logouts  Open connections  Defunct processes  CPU sapping orphaned processes  Monitoring +  beniced +  Monthly reboots

22 2002/10/25HEPiX fall 2002: Interactive Reboots

23 2002/10/25HEPiX fall 2002: Conclusions  Shared clusters present more user opportunities  Both Good and Bad !  Don’t represent a panacea for sysadmins !


Download ppt "“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT."

Similar presentations


Ads by Google