Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.

Similar presentations


Presentation on theme: "“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT."— Presentation transcript:

1 “Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT

2 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch2 Contents  The road to shared clusters  Batch cluster  Configuration  User challenges  Addressing the challenges  Interactive cluster  Load balancing  Conclusions

3 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch3 The Demise of Free Choice 2000 200120022003

4 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch4 Cluster Aggregation

5 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch5 Organisational Compromises  Clusters per Groups  Sized for the average  users  Sized for user peaks  users  financiers : wasted resources  Invest effort in recooperating cycles for other groups  Configuration differences / specialities  Bulk Production Clusters  Production fluctuations dwarf those in user anal  Complex cross-submission links

6 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch6 Production Farm: Planning

7 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch7 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers

8 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch8 Simple, Uniform Shared Cluster ?

9 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch9  Partitioning  Still have identified resources  Uniform configuration  Sharing  Repartitioning or soak-up queues  If owner experiment reclaims resources, must suspend soak-up jobs – stranded jobs ALICEATLASCMSLHCbALEPHDELPHIL3OPALCOMPASSNtofOPERASLAPPARCPARC IntCVSBUILDDELPHI IntCSFPublic

10 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch10 LSF Fair-Share  Trade-in partition for a share  Multilevel  ATLAS 10%, CMS 12%, …  cmsprod 45%, HiggsWG 15%, …  usera 10%, userb 80%, userc 10%  Extra shares for productions  Effort: Juggling resources to Accounting  Demonstrating fairness  Protecting  Policing

11 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch11 Facts and Figures  Accounting  LSF job records  Process with C-program  Load into Oracle DB  Prepare plots/tables with Crystal Reports package  LSFAnalyser ?  Monitoring  Poll the user access tools  SiteAssure ?

12 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch12 CPU Time / Week Merged user analysis and production farms

13 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch13 Performance of Batch Job Slot Analysis ThuFriSa 10 min / tick

14 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch14 Challenging Batch (I)  Probing boundaries  Flooding  Concurrent starts  Uncontrolled status polling  Hitting limits  Disk space /tmp /pool /var  Memory, Swap Full  Guarantees for other user jobs?  System Issues  Queue drainers

15 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch15 Challenging Batch (II)  Un-Fair-Share  Logging onto batch machines  Batch jobs which resubmit themselves  Forking sessions back to remote hosts  Wasting resources  Spawning processes which outlive the jobs  Sleeping processes  Copying large AFS trees  Establishing connections to dead machines

16 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch16 Counter Measures  File system quotas  Virtual memory limits  Concurrent jobs limits per user/group  Restricted access through PAM  Instant response queues  Master node setup  Dedicated, 1GB memory  Failover cluster

17 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch17 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers LSF MultiCluster

18 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch18 Shared Clusters lxplus001 lxbatch001 DNS load balancing LSF disk001 rfio tape001 rfio disk001 tape001 750 Batch Servers 70 Interactive Servers 120 Disk Servers Single Cluster

19 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch19 Interactive Cluster  DNS load balancing (ISS)  Weighted load indexes  load, memory  swap rate, disk IO rate  # processes, # sessions, # window mgr sessions  Exclusion thresholds  file systems full, nologins  DNS publish 2 every 30 seconds  Random from lowest 5

20 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch20 Daily Users 35 users / node

21 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch21 Challenging Interactive  Sidestep load balancing  Parallel sessions across farm  Running daemons  Brutal logouts  Open connections  Defunct processes  CPU sapping orphaned processes  Monitoring +  beniced +  Monthly reboots

22 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch22 Interactive Reboots

23 2002/10/25HEPiX fall 2002: Tim.Smith@cern.ch23 Conclusions  Shared clusters present more user opportunities  Both Good and Bad !  Don’t represent a panacea for sysadmins !


Download ppt "“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT."

Similar presentations


Ads by Google