DataGrid is a project funded by the European Union CHEP 2003 – 24-28 March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards.

Slides:



Advertisements
Similar presentations
WP1 Grid Workload Management Massimo Sgaravatto INFN Padova
Advertisements

Grid Workload Management (WP 1) Report to INFN-GRID TB Massimo Sgaravatto INFN Padova.
Workload Management David Colling Imperial College London.
EGC 2005, CrossGrid technical achievements, Amsterdam, Feb. 16th, 2005 WP2-3 New Generation Environment for Grid Interactive MPI Applications M igrating.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
The Grid Constantinos Kourouyiannis Ξ Architecture Group.
Job Submission The European DataGrid Project Team
David Colling Imperial College London Running your jobs everywhere.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
E. Ronchieri – n° 1 EDG release 2 Elisabetta Ronchieri INFN CNAF - DataGrid WP1 – Workload Management System
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The EDG Workload Management System – n° 1 The EDG Workload Management System.
Grid Infrastructure.
INFSO-RI Enabling Grids for E-sciencE Grid Infrastructure & Related Projects Eddie Aronovich Tel-Aviv University, School of CS
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto - INFN Padova Francesco Prelz – INFN Milano.
“Grey areas” of the new architecture Massimo Sgaravatto INFN Padova.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Computational grids and grids projects DSS,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Enabling Grids for E-sciencE Workload Management System on gLite middleware Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi (Vietnam)
M. Sgaravatto – n° 1 The EDG Workload Management System: release 2 Massimo Sgaravatto INFN Padova - DataGrid WP1
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
DataGrid is a project funded by the European Commission under contract IST rd EU Review – 19-20/02/2004 WP1 activity, achievements and plans.
Grid Workload Management Massimo Sgaravatto INFN Padova.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Job Submission The European DataGrid Project Team
Job Submission and Resource Brokering WP 1. Contents: The components What (should) works now and configuration How to submit jobs … the UI and JDL The.
Grid checkpointing in the European DataGrid Project Alessio Gianelle – INFN Padova Rosario Peluso – INFN Padova Francesco Prelz – INFN Milano Massimo Sgaravatto.
M. Sgaravatto – n° 1 Overview of WP1 Workload Management System in EDG 2.x Massimo Sgaravatto INFN Padova - DataGrid WP1
EGEE is a project funded by the European Union under contract IST Job Description Language - more control over your Job Assaf Gottlieb University.
EGEE is a project funded by the European Union under contract IST EGEE Tutorial Turin, January Job Services Emidio.
M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
EGEE is a project funded by the European Union under contract INFSO-RI Practical approaches to Grid workload management in the EGEE project Massimo.
High-Performance Computing Lab Overview: Job Submission in EDG & Globus November 2002 Wei Xing.
EGEE is a project funded by the European Union under contract IST WS-Based Advance Reservation and Co-allocation Architecture Proposal T.Ferrari,
Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid2Win : gLite for Microsoft Windows Roberto.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
EGEE is a project funded by the European Union under contract IST Job Description Language – How to control your Job Nadav Grossaug IsraGrid.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.
Job Submission The European DataGrid Project Team
User Interface UI TP: UI User Interface installation & configuration.
Introduction to Computing Element HsiKai Wang Academia Sinica Grid Computing Center, Taiwan.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
EU 2nd Year Review – Feb – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )
Workload Management Workpackage
The EDG Testbed Deployment Details
Workload Management System ( WMS )
EGEE tutorial, Job Description Language - more control over your Job Assaf Gottlieb Tel-Aviv University EGEE is a project.
Job Submission in the DataGrid Workload Management System
Introduction to Grid Technology
5. Job Submission Grid Computing.
The EU DataGrid Job Submission Services
Presentation transcript:

DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards the second major release Massimo Sgaravatto INFN Padova - DataGrid WP1

CHEP 2003 – March 2003 – M. Sgaravatto – n° 2 Talk Outline u DataGrid and Datagrid WP1 u The DataGrid Workload Management System u The new revised WP1 WMS architecture n What’s has been revised and new functionalities u Future work & conclusions Authors G. Avellino, S. Beco, B. Cantalupo, F. Pacini, A. Terracina, A.Maraschini (DATAMAT S.p.A) D. Colling (Imperial College) S. Monforte, M. Pappalardo (INFN, Sezione di Catania) L. Salconi (INFN, Sezione di Pisa) F. Giacomini, E. Ronchieri (INFN, CNAF) D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Pospisil, M. Ruda, Z. Salvet, J. Sitera, M. Vocu (CESNET) M. Mezzadri, F. Prelz (INFN, Sezione di Milano) Gianelle, R. Peluso, M. Sgaravatto (INFN, Sezione di Padova) S. Barale, A. Guarise, A. Werbrouck (INFN, Sezione di Torino)

CHEP 2003 – March 2003 – M. Sgaravatto – n° 3 DataGrid and DataGrid WP1 u European DataGrid Project n Goal: Grid software projects meet real-life scientific applications (High Energy Physics, Earth Observation, Biology) and their deadlines, with mutual benefit n Middleware development and integration of existing middleware n Bring the issues of data identification, location, transfer and access into the picture n Large scale testbed u WP1 (Grid Workload Management) n Mandate: “To define and implement a suitable architecture for distributed scheduling and resource management on a GRID environment“ n This includes the following areas of activity: s Design and development of a useful (as seen from the DataGrid applications perspective) grid scheduler, or Resource Broker s Design and development of a suitable job description and monitoring infrastructure s Design and implementation of a suitable job accounting structure

CHEP 2003 – March 2003 – M. Sgaravatto – n° 4 WP1 Workload Management System u Working Workload Management System prototype implemented by WP1 in the first phase of the EDG project (presented at CHEP2001) n Ability to define (via a Job Description Language, JDL) a job, submit it to the DataGrid testbed from any user machine, and control it n WP1's Resource Broker chooses an appropriate computing resource for the job, based on the constraints specified in the JDL s Where the submitting user has proper authorization s That matches the characteristics specified in the job JDL (Architecture, computing power, application environment, etc.) s Where the specified input data (and possibly the chosen output Storage Element) are determined to be "close enough" by the appropriate resource administrators u Application users have now been experiencing for about one year and a half with this first release of the workload management system n Stress tests and semi-production activities (e.g. CMS stress tests, Atlas efforts) n Significant achievements exploited by the experiments but also various problems were spotted s Impacting in particular the reliability and scalability of the system

CHEP 2003 – March 2003 – M. Sgaravatto – n° 5 Review of WP1 architecture u WP1 WMS architecture reviewed n To apply the “lessons” learned and addressing the shortcomings emerged with the first release of the software s Reduce of persistent job info repositories s Avoid long-lived processes s Delegate some functionalities to pluggable modules s Make more reliable (e.g.via file system) communication among components s … (see also other EDG WP1 talk given by Francesco Prelz) n To increase the reliability of the system n To favor interoperability with other Grid frameworks, by allowing exploiting WP1 modules (e.g. RB) also “outside” the EDG WMS n To support new functionalities n New WMS (v. 2.0) presented at the 2 nd EDG review and scheduled for integration at April 2003

CHEP 2003 – March 2003 – M. Sgaravatto – n° 6 WP1 reviewed architecture Details in EDG deliverable D1.4 …

CHEP 2003 – March 2003 – M. Sgaravatto – n° 7 Job submission example UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status edg-job-submit myjob.jdl Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; submitted Job Status UI: allows users to access the functionalities of the WMS Job Description Language (JDL) to specify job characteristics and requirements

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Input Sandbox files Job waiting submitted Job Status NS: network daemon responsible for accepting incoming requests

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status WM: responsible to take the appropriate actions to satisfy the request Job

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker Where must this job be executed ?

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker Matchmaker: responsible to find the “best” CE where to submit a job

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker Where are (which SEs) the needed data ? What is the status of the Grid ?

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker CE choice

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Job Adapter JA: responsible for the final “touches” to the job before performing submission (e.g. creation of wrapper script, etc.)

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Job Status JC: responsible for the actual job management operations (done via CondorG) Job submitted waiting ready

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Job Status Job Input Sandbox files submitted waiting ready scheduled

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Input Sandbox submitted waiting ready scheduled running “Grid enabled” data transfers/ accesses Job

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox submitted waiting ready scheduled running done edg-job-get-output

Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done cleared

CHEP 2003 – March 2003 – M. Sgaravatto – n° 22 Job monitoring UI Log Monitor Logging & Bookkeeping Network Server Job Contr. - CondorG Workload Manager Computing Element RB node LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB LB: receives and stores job events; processes corresponding job status Log of job events edg-job-status Job status

CHEP 2003 – March 2003 – M. Sgaravatto – n° 23 New functionalities introduced u User APIs n Including a Java GUI u “Trivial” job checkpointing service n User can save from time to time the state of the job (defined by the application) n A job can be restarted from an intermediate (i.e. “previously” saved) job state u Glue schema compliance u Gangmatching n Allow to take into account both CE and SE information in the matchmaking s For example to require a job to run on a CE close to a SE with “enough space” u Integration of EDG WP2 Query Optimisation Service n Help for RB to find the best CE based on data location u Support for parallel MPI jobs u Support for interactive jobs n Jobs running on some CE worker node where a channel to the submitting (UI) node is available for the standard streams (by integrating the Condor Bypass software)

CHEP 2003 – March 2003 – M. Sgaravatto – n° 24 Interactive Jobs Job Shadow Network Server Workload Manager Job Controller/CondorG LB Submission machine InputSandbox Computing Element WN LRMS Gatekeeper WN RSL Shadow port shadow host shadow port shadown host OutputSandbox Job Output Sandbox Input Sandbox JDL StdIn StdOut StdErr Files transfer New flows Usual Job Submission flows Pillow Process Console Agent UI edg-job-submit jobint.jdl jobint.jdl [JobType = “”interactive”; ListenerPort = 2654; Executable = “int-prg.exe"; StdOutput = Outfile; InputSandbox = "/home/user/int-prg.exe”, OutputSandbox = “Outfile”, Requirements = other. GlueHostOperatingSystemName == “linux" && Other.GlueHostOperatingSystemRelease == “RH 6.2“;]

CHEP 2003 – March 2003 – M. Sgaravatto – n° 25 Future functionalities u Dependencies of jobs n Integration of Condor DAGMan n “Lazy” scheduling: job (node) bound to a resource (by RB) just before that job can be submitted (i.e. when it is free of dependencies) u Support for job partitioning n Use of job checkpointing and DAGMan mechanisms s Original job partitioned in sub-jobs which can be executed in parallel s At the end each sub-job must save a final state, then retrieved by a job aggregator, responsible to collect the results of the sub-jobs and produce the overall output u Grid Accounting n Based upon a computational economy model s Users pay in order to execute their jobs on the resources and the owner of the resources earn credits by executing the user jobs n To have a nearly stable equilibrium able to satisfy the needs of both resource `producers' and `consumers' n To credit of job resource usage to the resource owner(s) after execution u Advance reservation and co-allocation n Globus GARA based approach u Development already started (most of this software already in a good shape) but integration foreseen after release 2.0

CHEP 2003 – March 2003 – M. Sgaravatto – n° 26 Conclusions u In the first phase of the EDG project, WP1 implemented a working Workload Management System prototype u Applications have been experiencing with this WMS for one year and a half u Revised WMS architecture (WMS v. 2.0 planned for integration in Apr 2003) n To address emerged shortcomings, e.g. s Reduce of persistent job info repositories s Avoid long-lived processes s Delegate some functionalities to pluggable modules s Make more reliable communication among components n To support new functionalities s APIs, Interactive jobs, Job checkpointing, Gangmatching, … n Hooks to support other functionalities planned to be integrated later s DAGman, Job partitioning, Grid accounting, Resource reservation and co-allocation u Other info n (Home page for EDG WP1) n (Home page for EDG project) Thanks to the EU and our national funding agencies for their support of this work