Presentation is loading. Please wait.

Presentation is loading. Please wait.

Glidein Factory Operations

Similar presentations


Presentation on theme: "Glidein Factory Operations"— Presentation transcript:

1 Glidein Factory Operations
glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD) glideinWMS training G.Factroy Operations

2 G. Factory Operation Categories
Factory node operations Serving VO Frontend Admin requests Keeping up with changes in the Grid Debugging Grid problems glideinWMS training G.Factroy Operations

3 G. Factory Operation Ongoing Costs
Factory node operations Pretty much runs itself, unexpected <1day/month Serving VO Frontend Admin requests Highly variable, average a few hours/week Keeping up with changes in the Grid Variable, currently O(10 hours)/week Debugging Grid problems More than we have effort for! Better tools could drastically reduce this glideinWMS training G.Factroy Operations

4 Factory node ops O(hours) / month The factory mostly just runs
Occasional upgrade of SW needed, but typically fast and painless Most effort going into investigating unexpected behavior, e.g. High load Weird problems after a reboot/OS upgrade Of course, installing a new node can take significant time But a very rare event O(hours) / month glideinWMS training G.Factroy Operations

5 VO FE Admin requests O(hours) / week
Adding a new VO FE can be expensive Apart from config changes, to help them start running However, relatively rare to have new VOs In steady state, VOs may request New sites New attributes g.Factory operators also must assist with debugging FE config changes Error logs come only to GF (currently) O(hours) / week glideinWMS training G.Factroy Operations

6 Following changes in the Grid
G.Factory operational principle is trust-but-verify G.Factory admins must approve any change in the G.Factory config Grid a very dynamic place At least one site makes a change every single day Mostly complaint driven, have no good tools to automate change discovery G.Factory admins thus must change the G.Factory config often Currently mostly a manual process Better tools would be welcome O(10 hours) / week glideinWMS training G.Factroy Operations

7 Grid debugging 1/2 With O(50k) glideins running at any time, we always find something broken somewhere Full spectrum of errors Broken worker nodes (validation errors) Broken CEs (authentication/startup/monitor errors) Network problems (glideins not registering) Mostly cannot directly solve the problem(s) i.e. have to notify remote Admins But we have to discover the root cause to get it solved glideinWMS training G.Factroy Operations

8 Many FTEs DC, if we had them
Grid debugging 2/2 Grid a difficult place to debug Most sites are black boxes for us Luckily, glideins provide lots of info in the logs When we get them... a broken site may not return anything useful, or anything at all Prodding the black box often needed Which is hard! And some problems may be VO specific, too Many FTEs DC, if we had them glideinWMS training G.Factroy Operations

9 What else we do? In order to make our life easier, we also
Host a test glideinWMS instance Develop new helper tools The test glideinWMS instance allows us to discover problems early, thus both Increasing user satisfaction Reducing the time needed in debugging errors We create helper tools to suit our needs And anything major we contribute back to glideinWMS glideinWMS training G.Factroy Operations

10 The test glideinWMS Instance
The test glideinWMS instance contains both a G.Factory and a VO Frontend This allows us end-to-end testing Major focus on the G.Factory, to test before deploying in production New SW releases New sites New services on existing sites glideinWMS training G.Factroy Operations

11 Summary Operating a G.Factory is much more than keeping the G.Factory service alive Indeed, this part takes almost a negligible amount of time Most effort going into debugging Grid-related problems At O(50k) CPUs, something is always broken somewhere Finally, providing expertise to help VO FE Admins also an essential part of the job glideinWMS training G.Factroy Operations

12 Acknowledgments This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training G.Factroy Operations


Download ppt "Glidein Factory Operations"

Similar presentations


Ads by Google