Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.

Similar presentations


Presentation on theme: "1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN."— Presentation transcript:

1 1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN

2 2 Outline  DIRAC brief history  Goals and scope  Architecture and components  Implementation technologies  Scale of the system  Project organization

3 3 DIRAC brief history  DIRAC project started in September 2002  First production in the fall 2002  PDC1 in March-May 2003 was the first successful massive production run  Complete rewrite of DIRAC by DC2004 in May 2004, incorporation of LCG resources (DIRAC2)  Extending DIRAC for distributed analysis tasks in autumn 2005  DIRAC review DIRAC – Distributed Infrastructure with Remote Agent Control

4 4 Production with DataGrid (Dec 2002) Eric van Herwijnen Edit Prod.Mgr Work flow Editor Production Editor Instantiate Workflow Job request Status updates DataGRID CE Production data Scripts Production DB Production Server Bookkeeping info Bookkeeping Updates Input sandbox: Job+ProdAge nt DataGRID Agent

5 5 Goals and scope  Provide LHCb Collaboration with a robust platform to  Run data productions on all the resources available to LHCb (PCs, clusters, grids)  Provide means to distribute LHCb data in real time according to the Computing Model  Provide well controlled environment to run efficiently User Analysis on the grid  Provide an efficient system of steering, monitoring and accounting of all the LHCb activities on the grid and other distributed resources

6 6 DIRAC design principles (1)  Light implementation  Must be easy to deploy on various platforms  Non-intrusive No root privileges, no dedicated machines on sites  Must be easy to configure, maintain and operate  Minimization of the human intervention  Should run autonomously once installed and configured  Platform neutral  At least for various Linux flavors  Porting of DIRAC agent to Windows was demonstrated

7 7 DIRAC design pronciples (2)  Using standard components and third party developments as much as possible  High level of adaptability  There will be always resources outside LCGn domain Sites that can not afford LCG, desktops, …  We have to use them all in a consistent way  Modular design at each level  Adding easily new functionality

8 8 DIRAC architecture and components

9 9 DIRAC Services and Resources DIRAC Job Management Service DIRAC Job Management Service DIRAC CE DIRAC Sites Agent Production manager Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor ConfigurationSvc FileCatalogSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser DIRAC services DIRAC Storage DiskFile gridftp LCG Resource Broker Resource Broker CE 1 CE 2 CE 3 Agent DIRAC resources FileCatalog Agent

10 10 Services and agents  Services are passive components which are responding on the incoming requests from their clients  Need inbound connectivity  Services are running on stable servers  At CERN, Barcelona, Marseille  Under the control of respective managers  Agents are light easy to deploy components which are animating the whole system by sending requests to resources and services  Need outbound connectivity only to well defined URLs  Only agents are running on production sites Non-intrusive – running in user space More secure by nature, no (almost) problems with firewalls

11 11 Agents: Site agents  Two kinds of agents  Site agents  Pilot agents  Site agents  Running usually on the site gatekeeper hosts  Deployed and updated by human intervention  Stable, running as daemon processes  Various purposes Job steering on the local cluster Data management on the local SE Bookkeeping of the jobs

12 12 Agents: Pilot agents  Pilot agents  Running on the worker node reserving it for immediate use by DIRAC  Steering job(s) execution and post-job operations (data upload, bookkeeping)  Performing workload management on the PC that they are owning  Pilot agents are forming an overlay network hiding the homogeneity of the underlying resources and providing immediate access to these resources  DIRAC Task Queue is the only waiting queue in the system with Pilot agents  This is the only way how the LHCb VO can impose its policies by prioritizing production and user jobs

13 13 Agent Container JobAgent PendingJobAgent BookkeepingAgent TransferAgent MonitorAgent CustomAgent DIRAC: Agent modular design  Agent is a container of pluggable modules  Modules can be added dynamically  Several agents can run on the same site  Equipped with different set of modules as defined in their configuration  Data management is based on using specialized agents running on the DIRAC sites

14 14 DIRAC workload management  DIRAC WMS consists of the following main components:  Central Job database and Task Queues  Job agents running on sites and on the worker nodes  Job wrappers Generated by job agents from templates providing job specific data

15 15 DIRAC workload management (2)  Realizes PULL scheduling paradigm  Agents are requesting jobs whenever the corresponding resource is free  Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile  Agents are steering job execution on site  Jobs are reporting their state and environment to central Job Monitoring service

16 16 DIRAC workload management (3)  Job wrappers  Download input sandbox  Provide access to the input data by generating an appropriate Pool XML slice  Invoke the job application  Running as a watchdog providing heart-beats for the Job Monitoring Service  Collecting job execution environment and consumption parameters, passing them to the Job Monitoring Service  Uploading output sandbox  Uploading output data

17 17 File Catalogs  DIRAC incorporated 3 different File Catalogs  Replica tables in the LHCb Bookkeeping Database  File Catalog borrowed from the AliEn project – now retired  LFC Python binding of the C client library Proprietary communication protocol  All the catalogs have identical client API’s  Can be used interchangeably  This was done for redundancy and for gaining experience  LFC will be retained as the only File Catalog  Other catalogs  Processing database with File Catalog interface

18 18 Data management tools  DIRAC Storage Element is a combination of a standard server and a description of its access in the Configuration Service  Pluggable transport modules: srm,gridftp,bbftp,sftp,http,…  SRM like functionality for protocol (TURL) resolution  DIRAC ReplicaManager (API and CLI)  get(), copy(), replicate(), register(), etc  Replicate with the third party transfer for gridftp if both ends support it  Getting the “Best replica” or at least the one which is available at the moment of access  Dealing with multiple catalogs  Logging all the performed operations

19 19 Reliable file transfers Reliable file transfer  Request DB keeps outstanding transfer requests  A dedicated agent takes file transfer request and retries it until it is successfully completed  Third party transfers or through local cache  Using WMS for data transfer monitoring

20 20 File Transfer with FTS  Start with central Data Movement  FTS+TransferAgent + RequestDB  Explore using local instances of the service at Tier1’s  Load balancing  Reliability  Flexibility LHCb LCG

21 21 Configuration Service  Configuration Service  Provides configuration information for various system components (services,agents,jobs)  Redundant with multiple slave servers for load balancing and high availability  Automatic slave updates from the master information  Watchdog to restart the server in case of failures  Running servers now at CERN, Marseille, Barcelona

22 22 Job Monitoring Service  Accumulates job status and parameters information as reported by other services, agents or job wrappers  Provides access to the job status and parameters for various clients  Command line  Web interface  Ganga  Optimized to serve bulk requests for mutliple jobs  Fast access to predefined parameters  Arbitrary parameters ( string key-value pairs ) can be published for each job

23 23 Job Accounting Service  After each job completion a report is sent to the accounting service  Provides statistical reports by various criteria  By period of time, site, production, etc  Provides visual representation of the reports published on a dedicated web page  Is used for data transfer accounting as well

24 24 DIRAC Job structure  Job consists of one or more steps; can be split in subjobs corresponding to steps.  Step consists of one or more modules; is indivisible w.r.t to the job scheduling;  Module is the least unit of the execution:  Use standard modules in production;  User defined modules in analysis can be used. SoftwareInstallation module GaussApplication module BookkeepingUpdate module Gauss Step SoftwareInstallation module BooleApplication module BookkeepingUpdate module Boole+Brunel Step BrunelApplication module Production Job

25 25 Job class diagram Job Step Module n n TypedParameter n n n ScriptletScriptShellApplication ApplicationFactory

26 26 Workflow definition  Workflow is an object of the same Job class;  Possibly not fully defined  DIRAC Console provides a graphical interface to assemble workflows of any complexity from standard building blocks (modules) :  Very useful for production workflow definitions;  May be too powerful ( for the moment ) for a user job.

27 27 Production Manager Tools Data Filter TA Production Manager tools Production Console Command line tools Repository Production1 ProdRequest1 ProdRequest2 Job1 Job2 Production2 Repository Agent Data Manager tools Transformation Definition tools TA Processing DB notify() File Catalogue Interface addFile() submitJob() getOutput() Transformation Agents DIRAC

28 28 Jobs persistency  Job, Workflow, Production objects can be stored to and reconstructed from an XML file or string  Storing objects in the Job Repository  Passing Job objects as part of the workflow description  Job execution  Either needs DIRAC software installed in order to interpreter the Job description ( XML )  Or Job can be converted into a pure Python program with no extra dependencies Not used so far Interesting for simple jobs

29 29 Implementation technoligies

30 30 Software tools  Python is the main programming language  Fast development cycle  Adequate performance of the components Did not feel the necessity to rewrite parts in C++ to increase performance  Production Manager Console is in C++/Qt Considering its migration to Python/PyQt  CERN CVS repository to maintain the code  Structured in subdirectories by component family  Distribution by tar files containing the whole DIRAC code base  May include also basic LCG tools ( GSI, gridftp client, LFC client ) bundled in a Linux flavor independent way

31 31 Services: XML-RPC protocol  Standard, simple, available out of the box in the standard Python library  Both server and client  Using expat XML parser  Secure Server – HGSE transport layer  GSI enabled authentication  Authorization rules based on user ids and roles  Supports up to 200 HZ request rate  Client  Problem free since both client and server are in python  Needs LCG UI ( grid-proxy-init )  Can be used with proxy generation by openssl CLI

32 32 Services: reliable running  Using runit set of tools  Analogous to SysV start-up script  Running in user space  Provides service watchdog to restart at failure or reboot  Rotating time stamped logs  Tools to monitor and control running services

33 33 Services: underlying database  WMS services are using MySQL database for  Job database  Task queues  Job Logging  Input/output sandboxes This will migrate to a real SE  Regular database backup  WMS can be completely restored from backup on the same or another machine  Needs more automation

34 34 Instant Messaging in DIRAC  Jabber/XMPP IM  asynchronous, buffered, reliable messaging framework  connection based Authenticate once “tunnel” back to client – bi-directional connection with just outbound connectivity (no firewall problems, works with NAT)  Used in DIRAC for  Communication of the WMS components  Monitoring and remote steering of Agents Pending until secure Jabber connection will be available  Interactivity channel with running jobs was demonstrated Pending until secure Jabber connection will be available

35 35 Scale of the system  During the RTTC production the single instance of the DIRAC WMS was managing up to 5’500 concurrent jobs  The limit was not reached Production server was running at ~20% CPU load  Could not get access to more resources  We are confident that ~10K of concurrent jobs with ~100K jobs in the queue is within the reach now  The LHC era numbers will factor 2-3 more  Will need work to increase the capacity of the services  Still single central service should be enough

36 36 Project organization  Project coordinator + developers for individual components  No official subgroups or subprojects  Weekly meetings to follow the progress and to discuss problems  Frequent unscheduled releases to incorporate new features and fix problems as soon as solution available  From once per day to once per month  Possible because of the simplicity of installation  On the grid each job is bringing the latest DIRAC with itself

37 37 Project organization (2)  Documentation  Poor, hope that this review will help improving it  Several notes available: Services installation Configuration Service Security framework DIRAC API for analysis jobs Production Manager docs  epydoc generated code documentation  DIRAC Savannah page  Collection of useful information  Bug reporting tool

38 38 Data production on the grid DIRAC Job Management Service DIRAC Job Management Service PilotJob Agent PilotJob Agent LCG Resource Broker Resource Broker CE 1 CE 2 CE 3 Agent Production DB DIRAC Job Monitoring Service DIRAC Job Monitoring Service Production manager LocalSE RemoteSE File Catalog LFC File Catalog LFC DIRAC CE DIRAC Sites WN Job


Download ppt "1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN."

Similar presentations


Ads by Google