CMS Issues
Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv Vmgr nsd Cupv Vmgr nsd Common Layer Instance Headnodes Diskservers (x20) CASTOR XROOT
Background – xroot infrastructure Diskservers (x20) Xroot manager( ) Xroot redirector (4.X) European redirector1 Global redirectors Local WNs The Grid
The Problem…s Pileup workflow –Local jobs had 95% failure rate –Jobs that managed to run had only 30% efficiency AAA failure –Despite being the second site to integrate into AAA –100% failure for periods of 30 minutes to several days
Tackling the Problems
Pileup Broken Down Data accessed through xroot >95% of data at RAL Two problems in one –Slow opening times (15->600 secs) –Slow transfers rates –100% CPU WIO
Slow Opening Times No obvious place –Delays at all phases –Almost all DB time spent in SubRequestToDo
Solution 1 (aka Go Faster Stripes Solution
Database Surgery DBMS_ALERT suspect to add to delays under load –Modified DB code to sleep for 50 ms (limiting rate to 20ms for subreqtodo) Tested on preprod (functionally) –Improved open time from 3-15 secs to 0-5 secs Deployed on all instances Made NO difference for CMS problem
Solution 2 (aka The Heart Bypass Solution)
Bypassing Scheduler Modified xroot to disable scheduling RISK –nothing restricting access to disk server –ONLY applied to CMS RESULT –Open times reduced to 1-30 seconds –WIO still flatlining at 100% ‘SUCCESS’
Improving IO Difficult to test –Could not generate artificially –Needed pileup workflow to be executing Testing on production ;) Did ‘the usual’ –Reducing allowed connections –Throttling batch jobs
Solution 3 (aka The Don’t Do This Solution) Change UNIX scheduler –Now easy and can be done in-situ Four schedulers (plus options) –Cfq (default), anticipatory, deadline, noop –Plus associated config Switched to noop –WIO dropped to 60% –Network rate increased 4x
XROOT Problems
Observations Random Failures (or more correctly random successes) Local access was OK (if slow – see previous) Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug
Investigating the Problem Set up parallel infrastructure –Replicate manager, RAL redirector and European redirector Immediately saw the same issue…
Causes of Failure… Caching! –Cmsd and xrootd timed out at different times –Xroot can return ENOENT, but later cmsd gets response, and subseq access work –If cmsd doesn’t get a response, all future requests get ENOENT But why the slow response…?
Log Mining… Each log looked like performance was good Part of problem –Time resoln in xroot 3.3.X –And logging generally Finally found delays in ‘local’ nsd –Processing time was good –But delays in servicing requests
Solution – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW EU Redirectors The Grid RAL Diskservers (x20) CASTOR XROOT Global Redirectors Nsd Xrd- mgr Nsd Xrd- mgr Xroot redirector (4.X) Local WNs Remote WNs