Job Life Cycle Management Libraries for CMS Workflow Management Projects Stuart Wakefield on behalf of CMS DMWM group Thanks to Frank van Lingen for the.

Slides:

Advertisements

Similar presentations

1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.

Advertisements

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.

Approaches to EJB Replication. Overview J2EE architecture –EJB, components, services Replication –Clustering, container, application Conclusions –Advantages.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.

GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.

A Model for Grid User Management Rich Baker Dantong Yu Tomasz Wlodek Brookhaven National Lab.

1: Operating Systems Overview

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Usage of the Python Programming Language in the CMS Experiment Rick Wilkinson (Caltech), Benedikt Hegner (CERN) On behalf of CMS Offline & Computing 1.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

HTML+JavaScript M2M Applications Viewbiquity Public hybrid cloud platform for automating and visualizing everything.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

Improving pS-PS Service Architecture , perfSONAR-PS Developers Meeting Aaron Brown, Andrew Lake, Eric Pouyoul.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Application portlets within the PROGRESS HPC Portal Michał Kosiedowski

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

Server to Server Communication Redis as an enabler Orion Free

Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.

A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.

ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

Use of the gLite-WMS in CMS for production and analysis Giuseppe Codispoti On behalf of the CMS Offline and Computing.

 An essential supporting structure of any thing  A Software Framework  Has layered structure ▪ What kind of functions and how they interrelate  Has.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

Copyright Office Material Copyright Request System.

Core and Framework DIRAC Workshop October Marseille.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

Self-Contained Systems

Architecture Review 10/11/2004

(on behalf of the POOL team)

BOSS: the CMS interface for job summission, monitoring and bookkeeping

BOSS: the CMS interface for job summission, monitoring and bookkeeping

Dirk Düllmann CERN Openlab storage workshop 17th March 2003

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

Chapter 2: Operating-System Structures

Chapter 2: Operating-System Structures

Presentation transcript:

Job Life Cycle Management Libraries for CMS Workflow Management Projects Stuart Wakefield on behalf of CMS DMWM group Thanks to Frank van Lingen for the slides 1

Motivation Converge on cross project common components – Uniform usage – Lower maintenance Prevent repetitive functionality implementation Address performance bottlenecks (e.g. database issues) Provide developers with sufficient tools such that they can focus on the (physics) domain specific part in their development 2

Architecture 3 Common low level / API layer (WMCore) – Grid/Storage interaction – LCG, OSG, ARC etc. – CMS services – authentication, databases, site info… Event driven components (WMAgent) -Generic component harness -Common library of components WMAgent T0 ProdAgent CRAB WMCore Common libraries Specialised WMAgent implementations

Structure of an Agent 4 Component specific

CMS Workflows: 3* layers 5 *Tier0 does not have a request layer

Job Life Cycle Management Different components based on WMCore handle various states of a job – Create, submit, track, etc… – Components involved with a job depends on its state Possible that there are multiple type of jobs – Component need to differentiate between job types Components can interact with third party services – Site db, site submission, mass storage, etc.. An application (e.g. CRAB, T0, Production) is a collection of components managing the life cycle – Not necessarily the same components 6

7 Create Submit Track Register DBS Register Phedex Cleanup Job Type 1 Create Submit Track Cleanup Job Type n………… Synchronization between parallel states Job Creator Job Submitter Job Tracker Job types and their states Components Representing state (operations) Cleanup SubmitJob CreateJob JobSuccess TrackJob Simplified Example!! Many more states (Error, Queued, Retry…) Communication through messages Life cycles of job (types)

8 CreateSubmitTrack MsgServiceTrigger Database WMBS FwkJobReport Harness JobSpec Site JobSpecJob Report WMCore provides common components without being context /project specific (e.g. CRAB, T0, Production) Overview & Example components Error Handling Register Merge sequential Parallel ThreadPool Some components work in sequence on jobs, others in parallel Cleanup

msg_queue buffer_in buffer_out Prevent single inserts and delete from large table. Buffer tables are purged/filled when a certain size is reached. But: Still problem when one component is ‘dead’ or ‘stuck’ and others have messages going through buffer_in  msg_queue  buffer_out. Messages dead component accumulate in msg_queue Solution (or option): For each component have their own buffer_in, msg_queue, and buffer_out Core msg metadata (e.g. subscriptions) + Msg Service Delivery of asynchronous messages 9

Core msg metadata (e.g. subscriptions) Msg_queue_component1 Msg_queu_component  Messages distributed over more tables (prevent large tables)  Soften impact of ‘dead’ component  Use table name pre/post fixing to prevent table name clashes. 10 Current transport implementation is based on inserting a message in a database. This transport mechanism can be replaced, but we still can use the rest of the persistent backend (~90%) including the buffering, outlined here to store the messages and to ensure no messages are lost. An example of such a transport layer is Twisted (

Other Core Services/Libraries (Persistent) Threadpool Worker threads – Long running threads within a component Trigger – Synchronization of components Database connection management – Through SQLAlchemy 11

Other Core Services/Libraries Web development (HTTPFrontend) – Facilitating development of web based components based on CherryPy WMBS Data model – Managing the relation between workflow, job and data products 12 Provide developers with sufficient tools such that they can focus on the (physics) domain specific part in their development

Workflow Management Bookkeeping System (WMBS) 13 Provide a generalized processing framework Current system designed for production not processing Subscription = workflow + fileset Automate as much as possible – Jobs created when new data in fileset available – Create subscriptions when new fileset produced, i.e. new runs taken Workflow defines how jobs created from data File Set Workflow Job Output Files File Details (input Files) * * * * * * subscriptions

Development Small team + tight schedule Use “Sprints” to make rapid progress Emphasize code style, quality, testing etc. Periodically produce test reports – Test on MySQL, SQLite and Oracle (not all developers have easy access to all architectures) – Name and shame developers with failures – Determine author from CVS 14

15 Run test_generate Edit generated files (e.g. change output log files, and mapping from developer to modules Run test_code Run test_style test_style conf_test_mysql.py conf_test_oracle.py failures1.rep failures2_mysql.rep failures2_oracle.rep failures3_mysql.rep failures3_oracle.rep Cvs log file Repeat (e.g. daily/weekly) Periodically update the test template files (e.g. once per month)

Skeleton Code Generation Existing components parsed to generate stubs for new style components Author’s then fill in the blanks (Handlers etc.), or Rewrite as necessary New (skeleton) components can be generated from a simple specification Heavy lifting taken care of - leaving the author to concentrate on the task at hand 16

(Workflow) Code Generation Workflow can be visualized – Components & messages 17 synchronizer = {'ID' : 'JobPostProcess',\ 'action' : 'PA.Core.Trigger.PrepareCleanup'} handler = {'messageIn' : 'SubmitJob',\ 'messageOut' : 'TrackJob|JobSubmitFailed',\ 'component' : 'JobSubmitter',\ 'threading' : 'yes',\ 'createSynchronizer' : 'JobPostProcess’} Defines a Trigger for component synchronization. Defines a handler in a worklfow which acts on a messageIn messages and produces messageOut messages. Threading means handling of messages is threaded

Conclusion CMS distributed projects are moving to a common codebase. – Library functionality (grid interaction etc.). – Common component functionality. Taking the opportunity to refactor a lot of the existing code and improve testing etc. Provide common data processing functionality. Aggressive schedule but aiming for reduced maintenance cost for the future 18