ATLAS Production Kaushik De University of Texas At Arlington LHC Computing Workshop, Ankara May 2, 2008.

Slides:

Advertisements

Similar presentations

Zhongxing Telecom Pakistan (Pvt.) Ltd

Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )

18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.

GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

The National Grid Service and OGSA-DAI Mike Mineter

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

R12 Assets A Look Inside SM. Copyright © 2008 Chi-Star Technology SM -2- High-Level Overview R12 Setups –Subledger Accounting –ADI Templates –XML Reports.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Executional Architecture

© Ericsson Interception Management Systems, 2000 CELLNET Drop Administering IMS Database Module Objectives To add a network elements to the database.

PSSA Preparation.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

The GridPP DIRAC project DIRAC for non-LHC communities.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

The GridPP DIRAC project DIRAC for non-LHC communities.

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

US ATLAS DDM Operations Alexei Klimentov, BNL US ATLAS Tier-2 Workshop UCSD, Mar 8 th 2007.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Computing Operations Roadmap

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

POW MND section.

Data Challenge with the Grid in ATLAS

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Readiness of ATLAS Computing - A personal view

LHC Data Analysis using a worldwide computing grid

Presentation transcript:

ATLAS Production Kaushik De University of Texas At Arlington LHC Computing Workshop, Ankara May 2, 2008

Kaushik De 2 Outline  Computing Grids  Tiers of ATLAS  PanDA Production System  MC Production Statistics  Distributed Data Management  Operations Shifts  Thanks to T. Maeno (BNL), R. Rocha (CERN), J. Shank (BU) for some of the slides presented here

May 2, 2008 Kaushik De 3 EGEE Grid + Nordugrid - NDGF (Nordic countries)

May 2, 2008 Kaushik De 4 OSG – US Grid

May 2, 2008 Kaushik De 5 Tiers of ATLAS  10 Tier 1 centers  Canada, France, Germany, Italy, Netherlands, Nordic Countries, Spain, Taipei, UK, USA  ~35 Tier 2 centers  Australia, Austria, Canada, China, Czech R., France, Germany, Israel, Italy, Japan, Poland, Portugal, Romania, Russian Fed., Slovenia, Spain, Switzerland, Taipei, UK, USA  ? Tier 3 centers

May 2, 2008 Kaushik De 6 Tiered Example – US Cloud Wisconsin Tier 3’s

May 2, 2008 Kaushik De 7 Data Flow in ATLAS

May 2, 2008 Kaushik De 8 Storage Estimate

May 2, 2008 Kaushik De 9 Production System Overview DQ2 Tasks requested by Physics Working Group approved Panda ProdDB submit CA,DE,ES,FR,IT, NL,TW,UK,US (NDGF coming) Clouds send jobs register files

May 2, 2008 Kaushik De 10  PANDA = Production ANd Distributed Analysis system  Designed for analysis as well as production  Project started Aug 2005, prototype Sep 2005, production Dec 2005  Works both with OSG and EGEE middleware  A single task queue and pilots  Apache-based Central Server  Pilots retrieve jobs from the server as soon as CPU is available  low latency  Highly automated, has an integrated monitoring system, and requires low operation manpower  Integrated with ATLAS Distributed Data Management (DDM) system  Not exclusively ATLAS: has its first OSG user CHARMM Panda

May 2, 2008 Kaushik De 11 Cloud Panda storage Tier 1 storage Tier 2s job output files job input files

May 2, 2008 Kaushik De 12 Panda/Bamboo System Overview site A Panda server site B pilot Worker Nodes condor-g Autopilot https submit pull End-user submit job pilot ProdDB job logger http send log bamboo LRC/LFC DQ2

May 2, 2008 Kaushik De 13 Apache + gridsite PandaDB clients Panda server  Central queue for all kinds of jobs  Assign jobs to sites (brokerage)  Setup input/output datasets  Create them when jobs are submitted  Add files to output datasets when jobs are finished  Dispatch jobs logger LRC/LFC DQ2 https pilot

May 2, 2008 Kaushik De 14 Apache + gridsite cx_Oracle Panda server https prodDB cron Bamboo  Get jobs from prodDB to submit them to Panda  Update job status in prodDB  Assign tasks to clouds dynamically  Kill TOBEABORTED jobs A cron triggers the above procedures every 10 min A cron triggers the above procedures every 10 min

May 2, 2008 Kaushik De 15  HTTP/S-based communication (curl+grid proxy+python)  GSI authentication via mod_gridsite  Most of communications are asynchronous  Panda server runs python threads as soon as it receives HTTP requests, and then sends responses back immediately. Threads do heavy procedures (e.g., DB access) in background  better throughput  Several are synchronous Client-Server Communication serialize (cPickle) HTTPS (x-www-form -urlencode) Client UserIF mod_python Panda Python obj mod_deflate Request Python obj deserialize (cPickle) Python obj Response

May 2, 2008 Kaushik De 16 Data Transfer  Rely on ATLAS DDM  Panda sends requests to DDM  DDM moves files and sends notifications back to Panda  Panda and DDM work asynchronously  Dispatch input files to T2s and aggregate output files to T1  Jobs get ‘activated’ when all input files are copied, and pilots pick them up  Pilots don’t have to wait for data arrival on WNs  Data-transfer and Job-execution can run in parallel Panda submit Job DQ2 subscribe T2 for disp dataset callback get Job pilot submitter finish Job callback add files to dest datasets run job data transfer

May 2, 2008 Kaushik De 17 Pilot and Autopilot (1/2)  Autopilot is a scheduler to submit pilots to sites via condor-g/glidein Pilot  Gatekeeper Job  Panda server  Pilots are scheduled to the site batch system and pull jobs as soon as CPUs become available Panda server  Job  Pilot  Pilot submission and Job submission are different Job = payload for pilot

May 2, 2008 Kaushik De 18 Pilot and Autopilot (2/2)  How pilot works  Sends the several parameters to Panda server for job matching (HTTP request)  CPU speed  Available memory size on the WN  List of available ATLAS releases at the site  Retrieves an `activated’ job (HTTP response of the above request)  activated  running  Runs the job immediately because all input files should be already available at the site  Sends heartbeat every 30min  Copy output files to local SE and register them to Local Replica Catalogue

May 2, 2008 Kaushik De 19 Production vs Analysis  Run on same infrastructures  Same software, monitoring system and facilities  No duplicated manpower for maintenance  Separate computing resources  Different queues  different CPU clusters  Production and analysis don’t have to compete with each other  Different policies for data transfers  Analysis jobs don’t trigger data-transfer  Jobs go to sites which hold the input files  For production, input files are dispatched to T2s and output files are aggregated to T1 via DDM asynchronously  Controlled traffics

May 2, 2008 Kaushik De 20 Current PanDA production – Past Week

May 2, 2008 Kaushik De 21 PanDA production – Past Month

May 2, 2008 Kaushik De 22 MC Production

May 2, 2008 Kaushik De 23 ATLAS Data Management Software - Don Quijote  The second generation of the ATLAS DDM system (DQ2)  DQ2 developers M.Branco, D.Cameron, T.Maeno, P.Salgado, T.Wenaus, …  Initial idea and architecture were proposed by M.Branco and T.Wenaus  DQ2 is built on top of Grid data transfer tools  Moved to dataset based approach  Datasets : an aggregation of files plus associated DDM metadata  Datasets is a unit of storage and replication  Automatic data transfer mechanisms using distributed site services  Subscription system  Notification system  Current version 1.0

May 2, 2008 Kaushik De 24 DDM components DQ2 dataset catalog DQ2 “Queued Transfers” Local File Catalogs File Transfer Service DQ2 Subscription Agents DDM end-user tools (T.Maeno, BNL) (dq2_ls,dq2_get, dq2_cr)

May 2, 2008 Kaushik De 25 DDM Operations Mode CERN LYON NG BNL FZK RAL CNAF PIC TRIUMF SARA ASGC lapp lpc Tokyo Beijing Romania grif T3 SWT2 GLT2 NET2 WT2 MWT2 T1 T2 T3 VO box, dedicated computer to run DDM services US ATLAS DDM operations team : BNL H.Ito, W.Deng,A.Klimentov,P.Nevski GLT2 S.McKee (MU) MWT2 C.Waldman (UC) NET2 S.Youssef (BU) SWT2 P.McGuigan (UTA) WT2 Y.Wei (SLAC) WISC X.Neng (WISC) T1-T1 and T1-T2 associations according to GP ATLAS Tiers associations. All Tier-1s have predefined (software) channel with CERN and with each other. Tier-2s are associated with one Tier-1 and form the cloud Tier-2s have predefined channel with the parent Tier-1 only. LYON Cloud BNL Cloud TWT2 Melbourne ASGC Cloud wisc

May 2, 2008 Kaushik De 26 Activities. Data Replication  Centralized and automatic (according to computing model)  Simulated data  AOD/NTUP/TAG (current data volume ~1.5 TB/week)  BNL has a complete dataset replicas  US Tier-2s are defined what fraction of data they will keep –From 30% to 100%.  Validation samples  Replicated to BNL for SW validation purposes  Critical Data replication  Database releases  replicated to BNL from CERN and then from BNL to US ATLAS T2s. Data volume is relatively small (~100MB)  Conditions data  Replicated to BNL from CERN  Cosmic data  BNL requested 100% of cosmic data.  Data replicated from CERN to BNL and to US Tier-2s  Data replication for individual groups, Universities, physicists  Dedicated Web interface is set up

May 2, 2008 Kaushik De 27 Data Replication to Tier 2’s

May 2, 2008 Kaushik De 28 You’ll never walk alone Weekly Throughput 2.1 GB/s out of CERN From Simone Campana

May 2, 2008 Kaushik De 29 Subscriptions  Subscription  Request for the full replication of a dataset (or dataset version) at a given site  Requests are collected by the centralized subscription catalog  And are then served by a site of agents – the site services  Subscription on a dataset version  One time only replication  Subscription on a dataset  Replication triggered on every new version detected  Subscription closed when dataset is frozen

May 2, 2008 Kaushik De 30 Site Services  Agent based framework  Goal: Satisfy subscriptions  Each agent serves a specific part of a request  Fetcher: fetches up new subscription from the subscription catalog  Subscription Resolver: checks if subscription is still active, new dataset versions, new files to transfer, …  Splitter: Create smaller chunks from the initial requests, identifies files requiring transfer  Replica Resolver: Selects a valid replica to use as source  Partitioner: Creates chunks of files to be submitted as a single request to the FTS  Submitter/PendingHandler: Submit/manage the FTS requests  Verifier: Check validity of file at destination  Replica Register: Registers new replica in the local replica catalog  …

May 2, 2008 Kaushik De 31 Typical deployment  Deployment at Tier0 similar to Tier1s  LFC and FTS services at Tier1s  SRM services at every site, including Tier2s Site Services Site Services Site Services Central Catalogs

May 2, 2008 Kaushik De 32 Interaction with the grid middleware  File Transfer Services (FTS)  One deployed per Tier0 / Tier1 (matches typical site services deployment)  Triggers the third party transfer by contacting the SRMs, needs to be constantly monitored  LCG File Catalog (LFC)  One deployed per Tier0 / Tier1 (matches typical site services deployment)  Keeps track of local file replicas at a site  Currently used as main source of replica information by the site services  Storage Resource Manager (SRM)  Once pre-staging comes into the picture

May 2, 2008 Kaushik De 33 DDM - Current Issues and Plans  Dataset deletion  Non trivial, although critical  First implementation using a central request repository  Being integrated into the site services  Dataset consistency  Between storage and local replica catalogs  Between local replica catalogs and the central catalogs  Lot of effort put into this recently – tracker, consistency service  Prestaging of data  Currently done just before file movement  Introduces high latency when file is on tape  Messaging  More asynchronous flow (less polling)

May 2, 2008 Kaushik De 34 ADC Operations Shifts  ATLAS Distributed Computing Operations Shifts (ADCoS)  World-wide shifts  To monitor all ATLAS distributed computing resources  To provide Quality of Service (QoS) for all data processing  Shifters receive official ATLAS service credit (OTSMoU)  Additional information  

May 2, 2008 Kaushik De 35 Typical Shift Plan  Browse recent shift history  Check performance of all sites  File tickets for new issues  Continue interactions about old issues  Check status of current tasks  Check all central processing tasks  Monitor analysis flow (not individual tasks)  Overall data movement  File software (validation) bug reports  Check Panda, DDM health  Maintain elog of shift activities

May 2, 2008 Kaushik De 36 Shift Structure  Shifter on call  Two consecutive days  Monitor – escalate – follow up  Basic manual interventions (site – on/off)  Expert on call  One week duration  Global monitoring  Advice shifter on call  Major interventions (service - on/off)  Interact with other ADC operations teams  Provide feed-back to ADC development teams  Tier 1 expert on call  Very important (ex. Rod Walker, Graeme Stewart, Eric Lancon…)

May 2, 2008 Kaushik De 37 Shift Structure Schematic by Xavier Espinal

May 2, 2008 Kaushik De 38 ADC Inter-relations ADCoS Central Services Birger Koblitz Operations Support Pavel Nevski Tier 0 Armin Nairz DDM Stephane Jezequel Distributed Analysis Dietrich Liko Tier 1 / Tier 2 Simone Campana Production Alex Read

May 2, 2008 Kaushik De 39

May 2, 2008 Kaushik De 40