Presentation is loading. Please wait.

Presentation is loading. Please wait.

Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011.

Similar presentations


Presentation on theme: "Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011."— Presentation transcript:

1 Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011

2 New Items condor & proof Monitoring Service Availability Monitoring(SAM). Every condor slave in the cluster will receive a test job every hour, results will be merged into web monitoring page, alarm mail will be sent out if any of them failed. Similar idea for proof No priority for SAM jobs; Add system load while the system load is already quite high NFS failing on some WNs Some jobs will fail directly Popular problem with NFS, usually fixed by crond. (2)

3 Items with improvement condor upgrade on valtical cluster condor-7.6.4-1.x86_64 has been installed on all machines in valtical cluster, twiki updated as well, to run condor commands user doesn’t need to do any speical enviroment setting Configure files for condor master & slave are different, to be uniformed in the furture in scheduled maintenance. Optimization of crontab to restart the xrtood & proofd sevice Deployed to all machines in the valtical cluster,. High CPU Overload (>100) on Valtical00 (NFS server) Caused by xrootd, around 50% of the xrootd data are saved on this machine (12TB) Possible solution Data rebalance between data servers, which means adding more disk to other WNs, this needs to change the Chasis, Carlos has ordered one and it has come today. Further tests will be organized. Filesize regulation: currently the size of xrootd files in the cluster jumps from ~20M to ~1G, a general idea is that disk I/O will benefit from larger size file, tests to be done. Adding RAID controller at the begging? (not possible now) (3)

4 Load Balancing Balance data importing and proof jobs When importing data to the cluster with xrdcp, proof jobs will be very slow or sometimes crashed Coordinate the data importing & proof job running time? Data importing before 9:00 and after 20:00 ? Send mail to the mailing list when data importing starts and ends? Load balance between Condor & Proof in the cluster Force condor daemon on client unable to get started when non-condor cpu load > 0.3 (further tests needed) (4)

5 Pending Items Evaluate filesystem migration from XRootd to EOS To be done. Find cause of regular IOwait problems in NFS share Problem is not on NFS service, but still we can do some NFS optimization Nfsd number adjustment: 8 fine Linux kernal optimization: no big improvement observed with an instant check, longer-time tests to be done. Better use NFS? disk I/O situation will be even worse when when xrootd is accessing files on the NFS server. Separate WN, NFS & UI with limited machines? (5)

6 Finished old items Revive valtical15 as SLC5 workstation Done and now it’s providing NFS service to the whole cluster (/data2, /data3, /data4) (6)

7 Thank you


Download ppt "Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011."

Similar presentations


Ads by Google