Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration.

Similar presentations


Presentation on theme: "Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration."— Presentation transcript:

1 Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration

2 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 2 Introduction Software release building process in ATLAS Offline ATLAS software components Code and configuration management Numbered and nightly releases TagCollector and software package approvals Parallel and Distributed Builds NICOS (Nightly COntrol System) Test and Validation Frameworks

3 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 3 ATLAS Experiment Characteristics Very complex detector ATLAS has ~80 M electronic channels Very large collaboration ATLAS involves ~3000 scientists and engineers from 174 institutions in 38 countries. Large geographically dispersed developers community ~600, mainly part time Large code base mostly in C++ and Python ~5M lines of code Several platforms and compilers in use Need to support stable production and development activities Rapid bug fixing High statistic testing before deployment

4 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 4 ATLAS Software Components Packages (~2000) Groups of C++ and/or Python classes Primary development/management units Dependencies against classes in other packages Some packages are externally supplied ~200 packages change/week (in all nightlies) Projects (~10) Groups of packages that can be built together Similar dependencies within project Domain specific (e.g. reconstruction, analysis) Primary release coordination units Some projects externally supplied Platforms (~6) Combinations of Operating System, compiler and compiler flags E.g. Scientific Linux 4 & 5, gcc 3.4 and gcc 4.3 in opt/dbg, 32b and 64b, icc & llvm Branches Self contained versions of software supporting development and production

5 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 5 Code and Configuration Management SVN (Subversion) Used as source code repository (atlasoff) All projects share same repository Can also incorporate software from other repositories SVN tags used to track stable package versions SVN permissions used to control access - read & write 2 additional seperated SVN repositories Users and groups specific code (atlasusr & atlasgrp) CMT (Configuration Management Tool) Used as primary build and configuration management tool Manages dependencies between packages and projects Every package must specify its configuration in a text file called “requirements” Easy to read and modify Establishes common patterns used by all packages and projects E.g. Compiler flags, linker options Definition of specific patterns or actions used only in desired packages Manages build process ensuring correct build sequence Sets up correct runtime environment

6 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 6 Numbered and nightly releases (1/2) Nightly releases Daily builds kept for several days (usually a week) before being overwritten Coupled with several validation frameworks (see later slides) Coupled with package approval mechanism by identified experts to minimize instabilities Migration releases Nightly releases for developer communities, allowing “disruptive” migrations Typically standard nightly releases with a few specific package modifications Numbered releases Stable snapshots of software used for production and analysis Deployed after high statistics validation Patch releases Providing overrides to fix small problems discovered after deployment of numbered releases. Physics analysis caches (10 of them and growing) : dedicated to physics group to capture specific analysis software on top of a patch release Numbered releases and patch releases distributed separately on the grid

7 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 7 16.0.1 16.0.1.1.1 16.0.1.1 16.0.1.2 16.0.1.1.2 16.0.1.1.3 16.0.1.1.4 16.0.1.1.5 Base release Patch releases Physics analysis patch releases Numbered and nightly releases (2/2)

8 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 8 Tag Collector and Package Approvals (1/2) Web based tool to assign packages to projects, package versions to releases and describe project dependencies Developer user interface to submit new package versions Newly submitted package versions go through validation and approval procedure before being accepted into release branch Set of release coordinators and approvals using automatically generated emails

9 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 9 Tag Collector and Package Approvals (2/2)

10 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 10 Parallel and Distributed Builds (1/2) Techniques required in order to build in timely manner (< 1 night) Platform Level Parallelism Builds for each platform performed in parallel and results merged together Build machine per platform ATLAS Nightly Control System (NICOS) Farm of dedicated Build Machines Project Level Parallelism Build projects having no mutual dependencies in parallel NICOS

11 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 11 Parallel and Distributed Builds (2/2) Package Level Parallelism Build packages with no cross dependencies in parallel Take advantage of multi-core chips Tbroadcast: implement that parallelism Defined on top of CMT. File Level Parallelism Parallel make (make -j ) Distributed compilation - Using distcc

12 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 12 NICOS Control system NICOS Control system ATN Test Tool Int Testing QA Testing Unit Testing Error Analysis Code Checkout Build Results Automatic e-mails ATLAS NIGHTLY BUILDS Tag Collector ATLAS SVN CMT NICOS (1/2)

13 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 13 NICOS (2/2) Nightly build stability is assured by a local disk usage and automatic discovery and recovery from failures. Nightly releases are checked out, built, and tested on local disks of build machines. Failures to connect to an external tool,such as Tag Collector, are followed by several repeat attempts. Quality of releases is immediately tested NICOS has integrated “Atlas Test Nightly” testing system Runs more than 300 integration tests in all domains of ATLAS software. From simple standalone tests to a few events full reconstruction ATN tests results are available for all platforms shortly after the release build completion. Set of shifters checking the results every day and reporting if any problems

14 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 14 NICOS Web Pages (1/2)

15 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 15 NICOS Web Pages (2/2)

16 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 16 Experience Usage of the local disk on build machines Avoid AFS problems and heavy AFS disk usage Tests also performed locally 1 platform used as master for copy to AFS - the other platforms merge in only binaries Use of trboadcast+distcc - make the builds fast Time to build/copy ~8h Previously took more than 24h - unacceptable for our needs NICOS has some error detection mechanisms incorporated to ensure proper build Checkout from SVN - wait and retry on failure Build problems retry (due to AFS or network failures) Retries in case of failures to connect to Tag Collector

17 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 17 Test and Validation Frameworks Several test and validation frameworks for nightly and numbered releases Nightlies: ATN (Atlas Testing Nightly) - Unit tests, Functionality tests at O(<10) events RTT (Run Time Tester) - Developers defined functionality tests at O(100) events FCT (Full Chain Test) - Production tests at O(1K) events Run on dedicated instance of RTT using simulated data TCT (Tier0 Chain Test) - Production tests at O(10K) events Run on dedicated instance of RTT using real data Numbered releases: BCT (Big Chain Test) - Production tests at O(1M) events Run on the GRID or central CERN processing facility (Tier0) using real data SampleA - Production tests at O(100K) events Run on the GRID using simulated data FCT, TCT, BCT and SampleA Checking of functionality and physics quantities Semi-automatic comparison against reference histograms

18 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 18 Run Time Tester Framework using RTT packages Unified test configuration.XML file -> defines RTT job RTT job is a three step process Manual and automated modes of running Developers create and upload/set RTT jobs RTT - finds, runs and publishes results TCT and FCT - run inside RTT

19 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 19 RTT Job Results

20 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 20 Build and Test clusters Build cluster Used for full, patch and migration nightlies Numbered releases created by copying and renaming nightly releases 4 Core – 8xSLC5 8 Core – 42xSLC5 8 distcc server machines - shared with other experiments Test cluster (RTT, FCT, TCT) 112 machines in total 13 launch nodes 99 test executors

21 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 21 Final conclusions Robust release building and validation infrastructure essential for large, complex HEP experiments with distributed developer base Performing the build and test process every night -> easy software development and increases its quality Combination of nightly and numbered releases to support development and production activities Rigorous validation important, both before and after deployment Rapid patching of problems discovered after deployment Require dedicated hardware and manpower resources Build cluster and multiple complementary validation testbeds

22 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 22 Questions ? Backup slides follow.

23 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 23 NICOS NICOS (NIghtly COntrol System) manages the multi-platform nightly releases of ATLAS software. Flexible and easy to use At initialization the information about the project tags and project dependencies is retrieved from Tag Collector Code checked out from ATLAS SVN repository Projects are build with CMT configuration management tools Quality checks, unit and integration tests are performed Build and test results posted on NICOS web pages, automatic e-mail notifications about problems sent to developers Builds of different projects and platforms are performed in parallel, with all processes thoroughly synchronized: Builds of independent projects are parallel Project is built then testing is started in parallel with the build of the next project in the chain Builds on different platforms are performed simultaneously on different build machines. Upon completion the results are merged to the single location on AFS file system. Parallelism allows to fully load multi-processor build machines and accelerate the nightly releases completion.

24 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 24 Tbroadcast Tbroadcast - implements parallelism across packages in a project. Python script defined on top of CMT. Parses “cmt show uses” command to get the dependency graph and other package information. Gives better compilation time.

25 Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 25 Distcc Distcc is a program to distribute builds of C, C++, Object C and Object C++ code Done on several machines in a network Generate same results as a local build (if setup correctly) Does not require all machines to share filesystem Distcc only runs compiler and assembler jobs Compiler and assembler take a single input file and produce single output file Distcc ships these files across the network Preprocessor runs locally Need to access header files on the local machine Linker runs locally Need to examine libraries and object files Build is easy - Distcc works with “make -j ”. The -j value is normally set to about twice the number of available CPUs ATLAS is using dedicated distcc cluster - lxdistcc kindly provided/supported by CERN/IT


Download ppt "Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration."

Similar presentations


Ads by Google