Presentation is loading. Please wait.

Presentation is loading. Please wait.

EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.

Similar presentations


Presentation on theme: "EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova."— Presentation transcript:

1 EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova

2 WP1 Workload Management System Ability to submit a job (described via the Condor ClassAD-based Job Description Language, or JDL) to the DataGrid testbed from any user machine The WP1 client allows to monitor and control (terminate) the job, and to transfer a "small" amount of data to and from the client machine and the executing machine WP1's RB chooses an appropriate computing resource for the job, based on the constraints specified in the JDL where the submitting user has proper authorization that matches the characteristics specified in the job ClassAD (Architecture, computing power, application environment, etc.) where the specified input data (and possibly the chosen output SE) are determined to be "close enough" by the appropriate resource administrators Throughout this process, WP1's Logging and Bookkeeping services maintain a "state machine" view of each job

3 dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;

4 WP1 current activities Support and bug fixes for EDG 1.2 release Addressing RB malfunctions under heavy load Crisis threshold raised from ~ 300 to ~ 600 simultaneous jobs Other problems (in the underlying Globus services affecting WP1 software) Globus GASS cache problems (under stress conditions) Problems with MDS and therefore also in II: stuck when a remote GRIS can’t be contacted Possible patch under investigation Working on year 2 new functionalities

5 New funct’s in release 1.2 Automatic proxy renewal  User credential renewed in RB/JSS and CE securely, without user interaction Use of Globus MyProxy Interim (working !) solution "Cleaner" solution later when mechanisms to forward the “fresh” proxy to the jobmanager available in the standard Globus distribution (GRAM 1.6 ?) and exploited in CondorG Automatic job resubmission  If job fails for a “Grid problem” (e.g. Globus GASS cache problems, gridftp fail transfer failed, etc.) job rescheduled (possibly on a different CE) and resubmitted

6 New funct’s in rel 1.3 (end of June) APIs for the applications C++ as first step Java bindings later – if needed by applications Ability to submit MPI jobs Starting considering MPI jobs within a single CE MPICH implementation with LSF and PBS

7 Year 2 activities Working on review of architecture Increase reliability and flexibility Single thread, one-shot service RB, plugged into CondorG RB only offering matchmaking functionalities, just responsible to find the best CE’s Simplification (e.g. minimize duplication of persistent information [relying on CondorG queue]) Support new functionalities Favor interoperability with other Grid frameworks, by allowing exploiting WP1 modules (i.e. RB) also “outside” the WP1 WMS RB can be called for example by CondorG Coordination between EDG WP1 and PPDG to define common guidelines

8 Other year 2 activities Support for interactive jobs Jobs running on some CE worker node where a channel to the submitting (UI) node is available for the standard streams (proof like applications) Working on a solution based on Condor bypass Support for job dependencies Integration of Condor DAGMan “Lazy” scheduling: job (node) bound to a resource (by RB) just before that job is submitted On-going discussions with Condor people to agree on a common recipe for ClassAds representation of DAGs Integration of EDG WP2 Query Optimisation Service Help for RB to find the best CE based on data location Trigger of input data transfer

9 Other year 2 activities Support for “trivial” job checkpointing User defines what is a state of a job of his ( pairs) and can save a state Computation can be restarted from a previously saved state Support for job partitioning Use of job checkpointing and DAGMan mechanisms Original job partitioned in sub-jobs which can be executed in parallel At the end each sub-job must save a final state, then retrieved by a job aggregator, responsible to collect the results of the sub-jobs and produce the overall output Integration of advance reservation/co-allocation Globus GARA based approach

10 Other year 2 activities Grid Accounting Based upon a computational economy model Users pay in order to execute their jobs on the resources and the owner of the resources earn credits by executing the user jobs To have a nearly stable equilibrium able to satisfy the needs of both resource `producers' and `consumers' To credit of job resource usage to the resource owner(s) after execution GUI Python-based GUI already implemented Java components in the works Possible use of EDG WP3 R-GMA for L&B services Tests and integration on going Discussions with WP3 folks to have some missing pieces Improving error reporting

11 Other year 2 activities Matchmaking in the RB considering also SE characteristics (besides CE) Gangmatchig (matchmaking between multiple [>2] entities) On-going discussions with Condor people to decide how to proceed Use of the new Glue IS Schema Goal: same IS schema between HENP US and EU Grid projects

12 Other info http://www.infn.it/workload-grid


Download ppt "EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova."

Similar presentations


Ads by Google