Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

Similar presentations


Presentation on theme: "Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,"— Presentation transcript:

1 Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu, M.Caprini, M.Dobson, R.Hart, R.Jones, A.Kazarov, S.Kolos, V.Kotov, D.Liko, L.Lucio, L.Mapelli, M.Mineev, L.Moneta, M.Nassiakou, L.Pedro, A.Ribeiro, Y.Ryabov, D.Schweiger, I.Soloviev, H. Wolters CHEP2001 Beijing China

2 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 2 Content Content The Online System in ATLAS TDAQ Testing in the Online System Aims of the large Scale and Performance Tests Approach Test Series and their Setup Test Configurations Results Experience and tips for doing large scale tests Future tests and Conclusions

3 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 3 TDAQ System/Context Detector ~ 200 nodes Detector Control System Physics & Event Selection Architecture (PESA) Event Store Offline Computing Online Software LVL1 Trigger Dataflow: ~800 nodes Readout System Data Collection High Level Trigger Reconstruction Framework (Athena) Selected Events HLT Strategy Algorithms LVL1 Result Detector DataLVL1 Input Configuration, Run Control, Process Control, Inter Process Communication Message Reporting, Info Service, Monitoring Detector ~ 200 nodes LVL1 Trigger Dataflow: ~800 nodes Readout System Data Collection High Level Trigger : elements running the online software

4 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 4 Aims of the large Scale and Performance Tests Verify Scalability Verify Scalability of the online system to a large configuration Study Interaction Study Interaction between the online components in a large configuration Measure Performance Measure Performance take timing values of the various setup, run control transition and shutdown phases Understand System Limits Understand System Limits Push the system to a very large size Perform selected Fault Tolerance tests Perform selected Fault Tolerance tests

5 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 5 Testing in the Online System Component Testing Component Testing Formal Inspection of Components Unit Tests of Components Nightly Builds with component check System Integration Testing System Integration Testing Nightly Builds with basic check on integration Last Successful Nightly Build available to developers Planned Public Releases Planned Public Releases 3-5 times a year Remote Test Centers to test the Pre-Release retrieving the system from a tar file or from CD-ROM Deployment Deployment in Test Beam Operation gives feedback

6 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 6 Approach for the large Scale and Performance Tests Test Preparation Test Preparation Test Plan prepared beforehand defines aims, scope, configurations, resources and describes the tests Testware Testware use of existing example programs for controllers and monitoring, use of standard setup script utility scripts to establish the configuration, and to start/stop process manager daemons Functionality of other systems emulated where necessary During the Tests During the Tests automatically produced test results and log files immediate logging and follow up of issues found fixes and enhancements verified in the next iteration

7 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 7 DAQ Configuration The ATLAS Detector Each sub-detector has a large number of readout nodes/crates The Online System Control Tree connects the sub-detectors Online system is responsible for Configuration Database Run Control Process Management Inter Process Communication Message Reporting Information Service Monitoring Control of a multi-detector system The configuration database describes a partition : information on all processes and their relationships the run control hierarchy in the online system startup and shutdown dependencies

8 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 8 Test: run each base partition separately run each base partition separately run base partitions in parallel run base partitions in parallel Detector Controller per crate/node: one run controller one monitoring sampler read out crates are linked to a detector controller Test Set-Up Hardware and Network Hardware and Network 6 test series on 3 Test clusters, 2 days - 1 week: 16, 65, 112 PCs, Linux 6.1, 400-733 Mhz, 128-512 MB afs, nfs, local network Base Partition Base Partition 10 independent partitions created 11 PCs per partition one process manager daemon

9 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 9 Test: run the 10 test partitions sequentially 10 configurations 10 configurations build from the base partitions up to 10 base partitions + 1 root controller + 1 monitoring factory one monitoring sampler per crate controller up to 112 PCs in a 3-level hierarchy Root Controller Detector controller 10crate controllers Test Configuration-3 Level Partitions Separate Partitions are combined Example for 112 nodes Monitoring factory

10 Nested Partitions in Configuration data file CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek See contribution for this conference: Atlas DAQ Configuration Databases by Igor Soloview

11 100 controller partition CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek

12 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 12 Timing tests: Logical View of Transitions

13 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 13 Setup/Boot/Shutdown/Close IT-Cluster Slow increase with larger configuration Constant Expected increase with number of processes Dependency problem discovered

14 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 14 Scalability and Performance RC state transitions IT-Cluster Heavy load of communication Single state transition Single state transition plus 1s 3 state transitions

15 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 15 Results in numbers For the large test partitions For the large test partitions on 112 PCs were ~ 340 processes running: 111 controllers, 100 monitoring samplers, 112 pmg daemons, ~10 servers, 1 monitoring factory ~ 850 entries in the database data file (250 sw, 600 hw) First large scale test: First large scale test: 45 issues found (bugs, problems, improvement suggestions) 52 days in equivalent of 8h working days for an elapsed time of 3 weeks test preparation and 1 week testing, excluding analysis, for 1-3 testers tons of log files Following iterations: Following iterations: re-use original test plan and add brief update preparation time reduced radically to ~ 2-3 days test runs mostly done automatically

16 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 16 Experience and Tips Preparation: Preparation: Require Unit Tests of components Prepare a detailed Test Plan beforehand Run large Scale Tests on a tested and frozen release Foresee expandable, flexible configuration and test infrastructure Encourage precise information logging for problem tracing Organization Organization Store the testware in the software repository Run the testware regularly/automatically to verify it is up to date Re-use test items like test structure, testware, scripts, checklists Network Network Use NFS not AFS Run on isolated network & monitor activity

17 CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 17 Conclusions and Future The online system can run a partition consisting of > 100 PCs The online system can run a partition consisting of > 100 PCs The online system can run partitions in parallel The online system can run partitions in parallel Scalability tests spot problems you can’t see in another manner Shielding from Cern network has a very positive effect 4 level hierarchy is behavior very similar to 3-level Very large scale Stress Tests help studying process communication Future Future Run basic integration test at each successful nightly build Repeat Tests on a regular basis for each major release building on existing material Push scale further to uncover new effects Automate the tests further Gradually include more SW items and components from other systems Many thanks to CMS and to CERN/IT for giving us access to their PC clusters


Download ppt "Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,"

Similar presentations


Ads by Google