Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support.

Similar presentations


Presentation on theme: "Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support."— Presentation transcript:

1 Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support

2 CMS and Alice site roles Tier0 –Initial reconstruction –Archive RAW + REC from first reconstruction –Analysis, detector studies, etc Tier1 –Archive a fraction of RAW (2 nd copy) –Subsequent reconstruction –“Skimming” (off AOD) –Archiving Sim data produced at T2s –Serve AOD data to other T1 and T2s –Analysis Tier2 –Simulation Production –Analysis

3 CMS Tools Bookkeeping tools –Description of data datasets, provenance, production information, data locatoin –DBS/DLS (Data Bookkeeping Service/Data Location Service) Production tools –Manage scheduled or personal production of MC, reconstruction, skimming –ProdAgent Transfer tools –All scheduled intersite transfers –Phedex User Tools –For end user to submit analysis jobs. –CRAB DBS/DLS ProdAgent Phedex WebInterface CRAB Oracle LFC UI FTS srmcp UI Oracle

4 Bookkeeping Tools Centrally set up database Has different scopes (“local”, “global”), to differentiate official data from personal/group data. Use Oracle server at CERN. Data location service uses either Oracle or LFC. Very complex system

5 Production Agent Managing all production jobs at all sites This is a set of daemons running on top of LCG UI by operators at their home institution. –usually a dedicated machine due to low submission rate and large CPU consumption per summision. –Operators certificates are used to submit jobs –Operators have VOMS role cmsprod Instances of it are used by production operators –1 for OSG + few for EGEE sites Merging production job output to create a reasonable size files. Registering new data in Phedex for transfer to the final destination –MC data produced at T2 to T1. All prod info goes to DBS User make MC production requests via the web interface and also can track their requests.

6 Phedex CMS data transfer tool The only tool that is required to run at all sites –Work in progress to ease this requirement – with SRM, fully remote operation is possible. However local support is still need to debug problems. Set of site-customizable agents that perform various transfer related tasks. –download files to site –produce SURL of local files for other sites to download –follow migration of files to the MSS –staging of files to the MSS –removing local files Uses ‘pull’ model of transfers, i.e. transfers are initiated at the destination site by Phedex running at this site. Uses Oracle at CERN to keep it’s state information Can use FTS to perform transfer, or srmcp –Or another mean, like direct gridftp, but CMS requires SRM at sites. One of the oldest and stable SW component of CMS –Secret of success: development is carried out by the CERN+site people who are/were involved in daily operations Uses someone’s personal proxy certificate

7 CRAB Main CMS user tool to submit analysis jobs Users specifies a data set to be analyzed, his application and job configuration. CRAB: –locates the data with DLS –splits jobs –submits jobs to the Grid –tracks jobs –collects results Can be configured to upload job output to some Storage Element with srm or gridftp.

8 “Last mile” tools working with ntuples, custom skims etc Not well developed! People are by now on their own. CRAB is not scaling to well down to end-user analysis – too rigid, requires maintaining bookkeeping infrastructure service etc. –Not convenient to physicists User questionnaire on CRAB (D-CMS group) –6 – use it 3 – like it 2 – report problems –3 - don’t use it 1 never used Grid 1 avoids using Grid and Crab for problems 1 need to run on unofficial data PROOF – is of interests to CMS. There are groups setting up PROOF at their institutions –access to CMS data structures via a loadable library. –working on a plugin for LFN lookup in DBS/DLS and PFN conversion.

9 Other CMS tools Software installation service –In CMS software is installed in VO area via grid jobs by software managers who have VOMS role cmssgm. –Also cmssgm run SW monitoring service –SW installation uses apt, rpm and scram. –Base releases are required to be installed at sites to run jobs –User bring only his own compiled code on top of the base release via grid sandbox. Dashboard –Collecting monitoring information (mostly jobs) and presenting through a web interface. –There is an effort to make use of Dashboard for all CMS monitoring –It is now beeing adopted by some other experiments –Dashboard relies on various grid components to get the state of a job, and thus it is usually inaccurate by ~15% ASAP –Automatic submission of analysis jobs using web interface. –Seems not to be used anymore. Used to give user an option to use his software release in a for of DAR (distribution archive).

10 Alice services - AliEn “Alice Environment’ - independent implementation of GRID middleware. Central services: –Job Queue –File catalog –Authorization Site services include: –Computing element With LCG/gLite support –Storage element – xrootd protocol –PackMan - SW installation –File Transfer Deamon with FTS support –ClusterMonitor –MonalLisa agent Services run at each site on a standard VO Box, optimized for low maintenance and unattended operations. –central and site services are managed by a set of experts including developers and regional responsible experts

11 Alice Services at a site AliEn Central Services VO Box Alien site services LCG CE Xrootd cluster JA submit, followConfiguration, Monitoring Reports get jobs $VO_ALICE_SW_DIR Install software Operator gsissh login

12 Job management Job Agent model –Alien CE (a site service) submits a job agent (‘pilot job’) to this site queue via local batch interface or LCG interface –Once such job agent starts on a WN, it contacts the central database (Job Queue, a certral service) and pick up real tasks in form of Alien JDL. Software needed by jobs (as specified in Alien JDL) is installed via PackMan site service. User input files are downloaded from remote xrootd servers to the local scratch area. Disadvantages of agent model in Alien –More agents are submitted then there are real jobs, and many job agents are exiting with out a real task after few minutes.

13 User tools Full featured user login shell – alien shell – in fact is the only official way to submit jobs in ALICE. –the same ALIEN tools are used by production and users. Data storage –global logical alien name space for production and home dirs accessible in the shell –backend is xrootd.

14 PROOF Main end-user analysis tool in ALICE PROOF is a part of ROOT and gives a mechanism to parallelize analysis of ROOT trees and ntuples. –Requires high speed access to data, not often possible in current setups WN and server bandwidth limitation –WN: 1GB per rack -> 1.5MB/s per cpu core with (4 cores * 24 WN/rack) –Servers: usually have 1Gb/s, but disk subsystem may perform better. –Deployment model endorsed by PROOF team: Large dedicated clusters with locally pre-loaded data –Not preferable at the moment at the large busy centers. Integration with Alien via own TFileAdapter –so a user gives it’s LFN, which is automatically converted to PFN by the adapter. So far only at CERN – CAF. At CC – work in progress (see next slide).

15 PROOF at CC-IN2P3 HPSS Local User SessionRemote session IFZ XROOTD analysis pool worker PROOF Master VO BOX GSI auth PROOF agents are run on the xrootd cluster and take advantage of the following: free cpu cycles due to low overhead of xrootd zero cost solution - no new hardware involved Direct access to data on disk, not using bandwidth 1GB node interconnect when inter-server access is required. transparent access to full data set stored at our T1 for all experiments via xrootd-dcache link deployed on this xrootd cluster and dynamic staging management of infrastructure conveniently fit into existing xrootd practices this setup is more close to possible 2008 PROOF solution because of 1GB node connection and large "scratch" space. this is a kind of setup that T2 sites may also considers to deploy

16 See also Thursday afternoon session on analysis centers.


Download ppt "Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support."

Similar presentations


Ads by Google