CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

CMS Issues

Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv Vmgr nsd Cupv Vmgr nsd Common Layer Instance Headnodes Diskservers (x20) CASTOR 2.1.14-15 XROOT 3.3.3-1

Background – xroot infrastructure Diskservers (x20) Xroot manager(3.3.3-1) Xroot redirector (4.X) European redirector1 Global redirectors Local WNs The Grid

The Problem…s Pileup workflow –Local jobs had 95% failure rate –Jobs that managed to run had only 30% efficiency AAA failure –Despite being the second site to integrate into AAA –100% failure for periods of 30 minutes to several days

Tackling the Problems

Pileup Broken Down Data accessed through xroot >95% of data at RAL Two problems in one –Slow opening times (15->600 secs) –Slow transfers rates –100% CPU WIO

Slow Opening Times No obvious place –Delays at all phases –Almost all DB time spent in SubRequestToDo

Solution 1 (aka Go Faster Stripes Solution

Database Surgery DBMS_ALERT suspect to add to delays under load –Modified DB code to sleep for 50 ms (limiting rate to 20ms for subreqtodo) Tested on preprod (functionally) –Improved open time from 3-15 secs to 0-5 secs Deployed on all instances Made NO difference for CMS problem 

Solution 2 (aka The Heart Bypass Solution)

Bypassing Scheduler Modified xroot to disable scheduling RISK –nothing restricting access to disk server –ONLY applied to CMS RESULT –Open times reduced to 1-30 seconds –WIO still flatlining at 100% ‘SUCCESS’

Improving IO Difficult to test –Could not generate artificially –Needed pileup workflow to be executing Testing on production ;) Did ‘the usual’ –Reducing allowed connections –Throttling batch jobs

Solution 3 (aka The Don’t Do This Solution) Change UNIX scheduler –Now easy and can be done in-situ Four schedulers (plus options) –Cfq (default), anticipatory, deadline, noop –Plus associated config Switched to noop –WIO dropped to 60% –Network rate increased 4x

XROOT Problems

Observations Random Failures (or more correctly random successes) Local access was OK (if slow – see previous) Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug

Investigating the Problem Set up parallel infrastructure –Replicate manager, RAL redirector and European redirector Immediately saw the same issue…

Causes of Failure… Caching! –Cmsd and xrootd timed out at different times –Xroot can return ENOENT, but later cmsd gets response, and subseq access work –If cmsd doesn’t get a response, all future requests get ENOENT But why the slow response…?

Log Mining… Each log looked like performance was good Part of problem –Time resoln in xroot 3.3.X –And logging generally Finally found delays in ‘local’ nsd –Processing time was good –But delays in servicing requests

Solution – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW EU Redirectors The Grid RAL Diskservers (x20) CASTOR 2.1.14-15 XROOT 3.3.6-1 Global Redirectors Nsd Xrd- mgr Nsd Xrd- mgr Xroot redirector (4.X) Local WNs Remote WNs

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Similar presentations

Presentation on theme: "CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Similar presentations

Presentation on theme: "CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv."— Presentation transcript:

Similar presentations

About project

Feedback