CMS production use case Number of Regional Centers11 Number of Computing Centers21 Number of CPUs~1000 Number of Production Passes for each Dataset (including analysis group processing done by production) 6-8 Number of Files~11,000 Data Size (Not including fz files from Simulation) 17TB File Transfer over the WAN 7TB toward T1 4TB toward T2 Bristol/RAL Caltech CERN FNAL IC IN2P3 INFN Moscow UCSD UFL WISC
Spring2002: cmsim 6M Events 4.3TB
Spring02: production summary CMSIM : 1.2 seconds / event for 4 months High luminosity Digitization : 1.4 seconds / event for 2 months Nbr of events 6M 3.5M April 19June 7 February 8May 31 requested produced requested produced 10 34
Toward ONE Grid Build a unique CMS-GRID framework (EU+US) EU and US grids not interoperable today. Need for help from the various Grid projects and middleware experts Work in parallel in EU and US Main US activities: PPDG, GriPhyN, iVDGL grid projects MOP Virtual Data Toolkit Interactive Analysis: Clarens Main EU activities: EDG, EDT grid projects Integration of IMPALA with EDG middleware Batch Analysis: user job submission & analysis farm
EU CMS testbeds participation DataGrid Testbed Participants: CERN, INFN, CNRS, PPARC, NIKHEF, ESA Curently testbed1, planned testbed2,3 over the project period DataTAG Participants: CERN, INFN, PPARC, UvA France? CMS sites involved CNAF, CCIN2P3, Bologna, Legnaro/Padova, Pisa, RAL, IC, Moscow, Ecole Polytechnique
PPDG MOP system PPDG developed Monte Carlo Distributed Production system Submission of CMS production jobs from a central location on a remote location, return results Relies on GDMP for replication Globus GRAM CondorG and local queuing systems for job scheduling Impala for job specification DAGMAN for management of dependencies between jobs (cmsim after cmkin, etc..) Being deployed in US-CMS grid testbed
GriPhyN/PPDG VDT US-CMS and GriPhyN/PPDG developing Virtual Data Toolkit Main idea of Virtual Data Materialization Cache to the user the real processing needed to satisfy request More important to keep information needed to recreate a DataSet than DataSet itself Introduce new catalogs: Virtual Data Catalog, Materialized Data Catalog Based on MOP for job submission Plan to use WP1 asap
CMS EDG Prototype EU working on integration of CMS production tools (Impala/BOSS) with EDG software Sites involved: École Polytechnique, Bologna, IC, Padova, Moscow synchronisation avec EDG1.2 devrait permettre dinclure Lyon Modify production tools Impala/BOSS to allow remote submission from any site that has UI installed Preinstalled CMS software at the CEs Interface with EDG submission (WP1) Interface with EDG data/file management (WP2)
Production processing Produce events dataset mu_MB2mu_pt4 IMPALA decomposition (Job scripts) JOBS RC BOSS DB IMPALA monitoring (Job scripts) Production RefDB Production Interface Production manager coordinates tasks distribution to Regional Centers Farm storage Request Summary file RC farm Regional Center Data location through Production DB
Job submission: BOSS submission to EDG scheduler BOSS on User Interface machine (UI) submits to EDG scheduler BOSS v3+ BOSS will accept and pass on a JDL file. jobGridExecuter, executable, std in, std out, std err automatically placed in JDL sandboxes. BOSS job monitoring over the grid. Monitoring DB open for write access to every machines!
Job submission: IMPALA on User Interface machine IMPALA uses BOSS on UI machine to submit job to EDG scheduler Modifications to IMPALA IMPALA creates EDG JDL files Post-process stage: GetOutputJob_edg.sh script Minor changes: Scripts on UI machine do not set up CMS environment, as only the WNs need the CMS software.
Job submission: IMPALA on User Interface machine GetOutputJob_edg.sh performs a dg-job-status for all submitted jobs. output from finished jobs is retrieved and moved into batch/logs. IMPALA tracking is updated i.e. cleared jobs > finished status aborted jobs > problem status
Job Submission: Current Status Job submission chain: IMPALA > BOSS > EDG-job scheduler is working at several sites. Jobs only sent to suitable CEs by using RunTimeEnvironment in JDL. Jobs are sent to the CE which is close to the SE that holds a replica of required input file.
File management:Flatfiles EDG tools used Resource Broker creates.BrokerInfo file, contains Logical filenames (LFs) Physical filenames (PFNs) Close SEs to the CE edg-brokerinfo parses.BrokerInfo for useful information, e.g. converts Logical Files to Transport filenames TFNs. GDMP copies files to a SE and registers them in a Replica Catalog. Version 3.0.x has been installed at IN2P3, INFN and IC.
File management:flatfiles IMPALA modifications GetReplica() Takes an input list of LFs, queries the.BrokerInfo file to get the corresponding TFNs (transport filenames) and copies the files to the current directory. The edg-brokerinfo tool relies on patched RB which produces correct.BrokerInfo file (bug fixed in EDG release 1.2)
File management: flatfiles IMPALA modifications StagetoSE() Takes a file and copies it using globus-url-copy to the close SE named in the.BrokerInfo file. Uses globus-job-run commands to perform remote operations on the storage element: Find the appropriate GDMP storage directory (by looking at the GDMP.conf file) Check & create output directory if necessary Information is published to the replica catalog using GDMP and also written to File_location which is returned to the user in the output sandbox.
File management: flatfiles Current status Ntuple file and fz files from CMKIN and CMSIM stages successfully staged out to SEs with information published in the Replica Catalog. Logical filename of ntuples put in JDL of CMSIM jobs, which is used within the job to retrieve a replica.
EDG prototype Summary CMKIN CMSIM stages of production have been run from several sites using EDG software for both job submission and replica management. Consolidate work into single EDG enabled version of IMPALA EDG release 1.2 Increase scale of tests Ensure grid resources have standard installations Interface with Mass Storage System (MSS) Demonstrate with small production (o(1000) jobs) Cest capital pour CMS …
Conclusions/Remarques CMS nest pas uniquement européenne Nombreux projets grilles aux US, nécessité dinter-opérer Actuellement lobjet de DataTag/Glue DataGrid devrait il me semble aller fortement dans cette direction Certificats, VO, Dagman? Autrement, gros risque de divergence Installation soft EDG actuelle inextricable DataTag travaille sur une distribution simplifiée utilisant Pacman CMS prévoit aussi lutilisation de Pacman pour la distribution des outils de production Là aussi, nécessité de converger On fonde beaucoup despoir sur les listes prédéfinies de Yannick, User help desk, etc..
Conclusions/remarques (II) Testbed(s) Il y a de nombreux testbeds DataGrid, DataTAG, CMS-US,.. Lévaluation des outils et intégration dans les logiciels actuels nécessite des séquences dinstallations et tests fréquentes et aussi souples que possible Situation actuelle DataGrid comme testbed de production Validation des releases stress tests Testbed spécifique CMS pour le développement évaluation des logiciels, accent sur les fonctionnalités évolution rapide de prototypes EDG+CMS sw