Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Similar presentations


Presentation on theme: "CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv."— Presentation transcript:

1 CMS Issues

2 Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv Vmgr nsd Cupv Vmgr nsd Common Layer Instance Headnodes Diskservers (x20) CASTOR 2.1.14-15 XROOT 3.3.3-1

3 Background – xroot infrastructure Diskservers (x20) Xroot manager(3.3.3-1) Xroot redirector (4.X) European redirector1 Global redirectors Local WNs The Grid

4 The Problem…s Pileup workflow –Local jobs had 95% failure rate –Jobs that managed to run had only 30% efficiency AAA failure –Despite being the second site to integrate into AAA –100% failure for periods of 30 minutes to several days

5 Tackling the Problems

6 Pileup Broken Down Data accessed through xroot >95% of data at RAL Two problems in one –Slow opening times (15->600 secs) –Slow transfers rates –100% CPU WIO

7 Slow Opening Times No obvious place –Delays at all phases –Almost all DB time spent in SubRequestToDo

8 Solution 1 (aka Go Faster Stripes Solution

9 Database Surgery DBMS_ALERT suspect to add to delays under load –Modified DB code to sleep for 50 ms (limiting rate to 20ms for subreqtodo) Tested on preprod (functionally) –Improved open time from 3-15 secs to 0-5 secs Deployed on all instances Made NO difference for CMS problem 

10 Solution 2 (aka The Heart Bypass Solution)

11 Bypassing Scheduler Modified xroot to disable scheduling RISK –nothing restricting access to disk server –ONLY applied to CMS RESULT –Open times reduced to 1-30 seconds –WIO still flatlining at 100% ‘SUCCESS’

12 Improving IO Difficult to test –Could not generate artificially –Needed pileup workflow to be executing Testing on production ;) Did ‘the usual’ –Reducing allowed connections –Throttling batch jobs

13 Solution 3 (aka The Don’t Do This Solution) Change UNIX scheduler –Now easy and can be done in-situ Four schedulers (plus options) –Cfq (default), anticipatory, deadline, noop –Plus associated config Switched to noop –WIO dropped to 60% –Network rate increased 4x

14 XROOT Problems

15 Observations Random Failures (or more correctly random successes) Local access was OK (if slow – see previous) Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug

16 Investigating the Problem Set up parallel infrastructure –Replicate manager, RAL redirector and European redirector Immediately saw the same issue…

17 Causes of Failure… Caching! –Cmsd and xrootd timed out at different times –Xroot can return ENOENT, but later cmsd gets response, and subseq access work –If cmsd doesn’t get a response, all future requests get ENOENT But why the slow response…?

18 Log Mining… Each log looked like performance was good Part of problem –Time resoln in xroot 3.3.X –And logging generally Finally found delays in ‘local’ nsd –Processing time was good –But delays in servicing requests

19 Solution – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW EU Redirectors The Grid RAL Diskservers (x20) CASTOR 2.1.14-15 XROOT 3.3.6-1 Global Redirectors Nsd Xrd- mgr Nsd Xrd- mgr Xroot redirector (4.X) Local WNs Remote WNs


Download ppt "CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv."

Similar presentations


Ads by Google