Presentation is loading. Please wait.

Presentation is loading. Please wait.

Srećko Morović Institute Ruđer Bošković

Similar presentations


Presentation on theme: "Srećko Morović Institute Ruđer Bošković"— Presentation transcript:

1 CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure
Srećko Morović Institute Ruđer Bošković On behalf of the CMS Collaboration CHEP 2010 October 19, 2010. Academia Sinica, Taipei, Taiwan

2 Introduction CMS Online DQM system
facilitates efficient detector operation by providing live data quality and integrity information during data taking (Dt ~minute live updates) provides CMS-wide real time access (experiment, CERN, remote) through web server (DQM GUI) archival of runs, display of past runs DQM system is fully integrated horizontally (all detector subsystems) and vertically (online, prompt and re-reco, simulation, software validation DQM), based on common infrastructure Online results and initial data quality assessment are available to offline processing and analysis (Offline DQM) DQM GUI display of Online/Offline data quality information →poster Valdas Rapsevicius Run Registry online / offline run book-keeping and summary tool CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

3 CMS Data Acquisition System
L1 selection Facilitates CMS data taking Detector data provider for the DQM (event, some histograms) High Level trigger (HLT) – System for online building, reconstruction, analysis and filtering of events (passed by the hardware L1 trigger) Storage Manager (SM) Application - stores and distributes HLT-accepted events multiple instances (~16) split the work on event stream from HLT support registration for event/histogram stream (DQM is a client) →Remi Mommsen: The Data Acquisition System of the CMS experiment at LHC CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

4 Online DQM Server: Overview
Event processing and histogramming system of the DQM produces all Online DQM data quality and integrity information Output: delivery of histograms, quality tests to the GUI Processing done on a DQM cluster: paralelly runing over 20 different subdetector processing jobs (DQM Applications) on several machines (up to 100 Hz event rate per job) producing ~300k histograms, ~50k displayed in the GUI Shifters: access smaller number of relevant summary histograms Quality tests for automated problem detection and notification For experts: huge sets of subdetector diagrams for diagnostics Streamlined, daily testing and deployment of subdetector code updates through the separate Integration System (Online replica) DQM Server DAQ Online DQM Input: DAQ event stream (+histograms) CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

5 DQM Server architecture
registration, data requests events, HLT histograms SM DQM function manager (Run Control) built out of individual (C++) event serving and processing Applications tied together by a DB Configuration DQM Function Manager Component of the “Run Control and Monitoring System” (RCMS): CMS detector control system (hierarchy of subdetector FSM’s ) responsible for Instantiation and state control of the DQM Server For online data-taking:itself controlled by DAQ Server configuration: RCMS (XML) configuration stored in global RCMS DB Storage Manager Proxy Server (SMPS) handles Registration management to DAQ SM cluster Decoupling the task from DQM Applications registers for all available data (all SM’s) sensitivity to low rate events sets data rate, event stream selection (HLT trigger based) Intermediate buffering of events and histograms ( and their summing) instance running on the each server machine locally to the subsystem application avoiding extra network data transfer CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

6 DQM Server architecture: Event Processor
FU Event Processor (FU) DQM Application histograms events/histograms DQM GUI DQM Collector DQM function manager DQM Applications subsystem-specific event processing and histogram production software written in standard CMS software framework (CMSSW): C++, python code Detector Online/Offline code reuse (required to access and process event information!) “InputSource” module: connection to SMPS DQM Core: Implementation of histogramming, quality test facilities (and much more) DQM Network module: facility for histogram transfer to Collector/GUI Filter Unit Event Processor Online equivalent of CMS execution environment HLT component, reused to run DQM Applications DQM Collector / GUI Collector receives processing results and delivers to GUI Long lived components, independent of RCMS and the detector run cycle CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

7 Server robustness features
FU Event Processor FU subprocess DQM Application Requirement: system flexibility and robustness to a fast changing environment and detector conditions →need for frequent updates of DQM Application code, framework DQM system is designed to be fault-tolerant: should not go to error state due to a single Application failure FU Event Processor Master-Slave model: Application runs as a forked child process of the FU: Application crash handling Event Processor master instance always remains alive in case of a sub-process crash Able to do automatic Application restart if configured to do so robustly finishes own Stop transition when Application does not stop gracefully often the least tested part of code DQM function manager tolerant to individual FU EP instance failure (stays in Running state) parallel startup of all FU’s each able to independently start data processing reliable server stopping timeout to wait for FU’s to Stop, then finish transition robustness to sub-component failure run end transition reliability CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

8 Parallel event processing model
Existing system: data processing within DQM Application is a serial chain of processing modules, done event-by-event Application can easily become CPU limited when doing intensive calculations (e.g. track reconstruction) →limits event rate per subsystem motivation: higher rate is good; better histogram statistics, better sensitivity to rare events: esp. needed with increasing amount of interesting data with tighter triggers (higher LHC luminosity) strategy: increase event rate by splitting event stream to multiple CPU cores for processing parallelization model carried over from the HLT Approaches: event funneling and histogram summing FU EP: spawns multiple child process copies special care taken to split event stream from SMPS to avoid data duplication DAQ ResourceBroker: data from FU’s received through shared memory buffer Storage Manager (local): receives all data, combines into single stream data forwarded to “collector” FU →exporting histograms to the GUI CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

9 Parallel event processing (continued)
EVENT FUNNELING “parallel” FU’s: event processing → stream combining → histograms filled in the “collector” FU Advantages: Output result correctness (excluding exact time ordering of events) Drawbacks: Only event processing chain runs in parallel, not histogram calculation, quality tests... suitable only when histogram filling is quick and event processing slow high performance cost of de/serialization of complex ROOT objects needed for inter-process data transfer HISTOGRAM SUMMING histograms from the split event stream summed by the Storage Manager Advantages: whole DQM Application runs in parallel no event de/serialization Drawbacks: possibly incorrect results due to summing correctness issues: No general method to combine histograms, implemented only for averages and cumulative data not applicable on non-summable (on-statistical information e. g. detector status diagrams Still experimental feature Needs per-Application performance assessment and “tuning” multicore hardware already hosting the Online DQM Server CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

10 Summary Online (and offline) DQM was in place and working robustly on day 1 of the LHC and has proven a cornerstone of CMS data taking efficiency and data certification (selection of runs for analysis). Up-to-date detector information (and a history of runs) available to the experiment by real-time event processing Infrastructure - application building blocks: FU EP, SMPS, SM... put together by RCMS configuration Built-in robustness (fault-tolerant design) →less problem-solving for the shifter, experts (valuable in night hours  ) Fast and flexible update policy allows up-to-date code use in production → important in fast changing environment (early data taking period...) limitations of existing system (processing power) are being looked at, extensions being developed (still experimental) → Aaron Soha: Web Based Monitoring in the CMS Experiment at CERN CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

11 BACKUP Srećko Morović: CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure

12 DQM GUI Web based, experiment-wide data quality information display
dynamic, asynchrounous UI updates Variety of information grouped and presented front page: detector overview content: mostly detector data integrity plots some physics plots “quick” plot collections for the shifter Provenance Application: subdet. on/off Detailed subdetector collections available for experts to look for issues CMS Online Data Quality Monitoring: Real-Time Event Processing Infrastructure


Download ppt "Srećko Morović Institute Ruđer Bošković"

Similar presentations


Ads by Google