Presentation is loading. Please wait.

Presentation is loading. Please wait.

OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004.

Similar presentations


Presentation on theme: "OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004."— Presentation transcript:

1 OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004

2 Gregory Dubois-Felsmann – (date)(subject)2 Obligatory caveat I’m available for advice and to provide continuity, but… I won’t be able to undertake any more non-trivial, non-emergency OEP development

3 Gregory Dubois-Felsmann – (date)(subject)3 What OEP is A conceptual unit of the online system –The framework for processing all complete-event data in the online system An implementation; a set of code that: –Defines and navigates the raw event structure (both the data from ODF and the persistence of data from Level 3) –Makes this data available to applications in the standard BaBar Framework Level 3 Fast Monitoring Other monitoring applications: event displays, beam spot monitoring, etc. –Provides distributed histogramming services –Controls the lifetimes of all the processes that do this work

4 Gregory Dubois-Felsmann – (date)(subject)4 Current status (Up-to-date performance metrics not available because we, unexpectedly, are not running.) The conceptual design and the event data format have turned out to work well and I don’t see a need to revise them The performance of the system as implemented has been very satisfactory for several years. –On the old Solaris farm, we were CPU-constrained, but that time was dominated by the performance of the Level 3 algorithms themselves –On the Linux farm we have had lots of headroom even at 1 L3/node… –Until quite recently: Rainer reports that since we started running Fast Monitoring on Linux (i.e., faster) and running the second monitoring farm instance for beam spot measurement, the trickle stream service has become CPU-intensive

5 Gregory Dubois-Felsmann – (date)(subject)5 There have been some upgrades Several iterations of improvement in process lifetime control tools (OepDaemon/OepManager – many thanks to Jim H.)… … which enabled running more Fast Monitoring processes and additional sets of them Rewrite from scratch of low-level DHP infrastructure, much fine- tuning Improvements in logging performance (see Jim’s talk)

6 Gregory Dubois-Felsmann – (date)(subject)6 There is more that can be done Framework overhead, and interface-to-Framework overhead –This was found typically to be about 25% in the old Solaris days –Can address several things: Framework overhead – Level 3 runs a large number of modules, so this can add up –There may be some effort invested in this motivated by speeding up the physics executables, which have enormous numbers of modules Interface-to-framework overhead – there’s some unnecessary copying of data that could be eliminated by trickier coding – probably trivial benefit Event navigation overhead – probably a 10% speedup in Level 3 from the long- planned “fast module scanning” project –This is a fairly straightforward non-multi-threaded programming problem and doesn’t need anything other than a good C++ programmer One related project, for the record –Making input modules work for non-event data

7 Gregory Dubois-Felsmann – (date)(subject)7 Still more that can be done CPU utilization –We have two CPUs on each farm node –The load from (ODF event level + OEP framework + Level 3 code) is concentrated in a single thread that runs the Level 3 algorithms –Could run two parallel streams of Level 3 processing –Requires a (much) more sophisticated version of the interface-to-Framework OEP code –This was in the original design but was sacrified to 1999-era schedule triage; the need hasn’t been acute enough since then (it only became relevant after the Linux upgrade) –This is a straightforward design but needs to be implemented by someone with a good understanding of multiprocessing –There are some technical questions about DHP and logging, basically: Are the multiple L3 instances to be treated as independent sources, or will they be re-aggregated per node?

8 Gregory Dubois-Felsmann – (date)(subject)8 Yet more that can be done Trickle stream –The Fast Monitoring architecture depends on transferring events over the network from the Level 3 processes, on a sampling basis, to other machines running the monitoring code –Apparently the server side of the existing system is expensive –The long-pending “advanced trickle stream” is being commissioned now. It shares no code with the old protocol, so we’ll have to re-measure this –It doesn’t seem likely to be an intrinsic problem – we receive a higher volume of data on the network from the event builder, very inexpensively –The more sophisticated event distribution system mentioned above would be able to take this load out of the Level 3 process –But one could consider a model in which (some) Fast Monitoring code runs on the same machines that run Level 3 There are concerns about further eroding the “deadtime firewall”

9 Gregory Dubois-Felsmann – (date)(subject)9 Scaling We run on 30 nodes now. We know we can run on 60 (from experience in the Sun era). We don’t quite understand the implications of running two (or more) instances of Level 3 per node for scaling of DHP and logging So the scaling of a (more nodes) x (more processes/node) system is not fully understood

10 Gregory Dubois-Felsmann – (date)(subject)10 Conclusions We will probably need to use one or more of these tools in order to get to 2007 The development work will require someone with a solid understanding of C++ and multiprocessing.


Download ppt "OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004."

Similar presentations


Ads by Google