Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR.

Slides:



Advertisements
Similar presentations
Workload Management David Colling Imperial College London.
Advertisements

EGEE is a project funded by the European Union under contract IST EGEE Tutorial Turin, January Hands on Job Services.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
The Grid Constantinos Kourouyiannis Ξ Architecture Group.
Job Submission The European DataGrid Project Team
Steve LloydGridPP13 Durham July 2005 Slide 1 Using the Grid Steve Lloyd Queen Mary, University of London.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
SEE-GRID-SCI Hands-On Session: Workload Management System (WMS) Installation and Configuration Dusan Vudragovic Institute of Physics.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Submission Fokke Dijkstra RuG/SARA Grid.
The EDG Workload Management System – n° 1 The EDG Workload Management System.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
Job Submission The European DataGrid Project Team
Elisabetta Ronchieri - How To Use The UI command line - 10/29/01 - n° 1 How To Use The UI command line Elisabetta Ronchieri by WP1 elisabetta.ronchieri.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
The gLite API – PART I Giuseppe LA ROCCA INFN Catania ACGRID-II School 2-14 November 2009 Kuala Lumpur - Malaysia.
Computational grids and grids projects DSS,
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards.
Enabling Grids for E-sciencE Workload Management System on gLite middleware Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi (Vietnam)
M. Sgaravatto – n° 1 The EDG Workload Management System: release 2 Massimo Sgaravatto INFN Padova - DataGrid WP1
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Nadia LAJILI User Interface User Interface 4 Février 2002.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
1 Esther Montes Prado CIEMAT 10th EELA Tutorial Madrid, Hands-on on WMS (Review and Summary)
Grid Workload Management Massimo Sgaravatto INFN Padova.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Job Submission The European DataGrid Project Team
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite job submission Fokke Dijkstra Donald.
Job Submission and Resource Brokering WP 1. Contents: The components What (should) works now and configuration How to submit jobs … the UI and JDL The.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Feb. 06, Introduction to High Performance and Grid Computing Faculty of Sciences,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Submission Fokke Dijkstra RuG/SARA Grid.
EGEE is a project funded by the European Union under contract IST Job Description Language - more control over your Job Assaf Gottlieb University.
EGEE is a project funded by the European Union under contract IST EGEE Tutorial Turin, January Job Services Emidio.
Job Management DIRAC Project. Overview  DIRAC JDL  DIRAC Commands  Tutorial Exercises  What do you have learned? KEK 10/2012DIRAC Tutorial.
M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova.
E-infrastructure shared between Europe and Latin America 1 Workload Management System-WMS Luciano Diaz Universidad Nacional Autónoma de México - UNAM Mexico.
Enabling Grids for E-sciencE Workload Management System on gLite middleware - commands Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi.
High-Performance Computing Lab Overview: Job Submission in EDG & Globus November 2002 Wei Xing.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
EGEE is a project funded by the European Union under contract IST WS-Based Advance Reservation and Co-allocation Architecture Proposal T.Ferrari,
INFSO-RI Enabling Grids for E-sciencE Job Submission Tutorial (material from INFN Catania)
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
Induction: General components of Grid middleware and User Interfaces –April 26-28, General components of Grid middleware and User Interfaces Roberto.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
EGEE is a project funded by the European Union under contract IST Job Description Language – How to control your Job Nadav Grossaug IsraGrid.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
Job Submission The European DataGrid Project Team
Biomed tutorial 1 Enabling Grids for E-sciencE INFSO-RI EGEE is a project funded by the European Union under contract IST JDL Flavia.
User Interface UI TP: UI User Interface installation & configuration.
LCG2 Tutorial Viet Tran Institute of Informatics Slovakia.
Istituto Nazionale di Astrofisica Information Technology Unit INAF-SI Job with data management Giuliano Taffoni.
GRID commands lines Original presentation from David Bouvet CC/IN2P3/CNRS.
Introduction to Computing Element HsiKai Wang Academia Sinica Grid Computing Center, Taiwan.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
EU 2nd Year Review – Feb – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )
Workload Management System on gLite middleware
Workload Management System ( WMS )
EGEE tutorial, Job Description Language - more control over your Job Assaf Gottlieb Tel-Aviv University EGEE is a project.
Job Submission in the DataGrid Workload Management System
Introduction to Grid Technology
Workload Management System
5. Job Submission Grid Computing.
login: clermont-ferrandxx password: GridCLExx
The EU DataGrid Job Submission Services
The gLite Workload Management System
Job Submission M. Jouvin (LAL-Orsay)
Presentation transcript:

Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR

Outline WMS introduction Job Submission Sequence and WMS Components User Job submit

Need Workload Management System Why we need workload management system?  For Grid environment: need distributed scheduling and resource management.  For a user: To submit their jobs. To execute them on the “best resources”. To get information about their status. To retrieve their output.

WMS Architecture UI RB CE/WN

WMS introduction Job Submission Sequence and WMS ComponentsJob Submission Sequence and WMS Components User Job submit

Job Submission Flow U I R B File catalog I S S E C E & W N UI JDL Input Sandbox Ouput Sandbox

RB node UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service Storage Element CE characts & status SE characts & status edg-job-submit –vo dteam Helloworld.jdl Executable = "/bin/echo"; Arguments = "Hello World.....o^.^o"; Stdoutput = "message.txt"; StdError = "stderror"; OutputSandbox = {"message.txt","stderror"}; Requirements = other.GlueCEUniqueID == "lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-detam"; submitted Job Status Job Description Language (.jdl) -specify job characteristics and requirements Computing Element

User Interface The user’s interface to the Grid. The basic functionalities are: - list the computing resources - submit a job, - get the job status, - cancel a job, -retrieve the output of a job. UI JDL

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status submitted Job Status UI: allows users to access the functionalities of the WMS (via command line, GUI, C++ and Java APIs) Input Sandbox files

Resource Broker Run the Workload Management System To accept job submissions It provides a matchmaking service: Dispatch jobs to appropriate Compute Element (CE) Allow users To get information about their status To retrieve their output A configuration file on each UI node determines which RB node(s) will be used.

Resource Broker (NS & WM) Network Server Network Server (NS) Accepting incoming requests from the UI. Authenticates the user. Obtains a delegated full proxy from the user proxy. Enqueues the job to the Workload Manager.. Workload Manager Workload Manager (WM) Calls Matchmaker to find the resource which best matches the job requirements. Interacting with Information System and File catalog. Calculates the ranking of all the matchmaked resource.

Resource Broker (JC & CondorG) Job Controller Job Controller (JC) Converts the condor submit file into ClassAd hands over the job to CondorG.Condor-G Condor-G is a Globus-enabled version of the Condor scheduler. CondorG consists two elements:  condor_gridmanager process: Interprets the ClassAD description and traslates it into RSL. submits the job to the CE; and it submits an extra job (the grid monitor) per CE and per user to monitor the user jobs.  The GAHP server It is a GRAM client to contact the edg- gatekeeper. It is a GASS server for the results from the grid monitor job.

Resource Broker (LM & LB) Log Monitor Log Monitor (LM) Continuously parses Condor-G logs. Looks for events concerning active jobs Logging and Bookkeeping (LB) All those information are stored by the logging and bookkeeping service. Collection is done by LB local-loggers

UI NS Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service CE characts & status SE characts & status RB storage Input Sandbox files Job waiting submitted Job Status NS:responsible for accepting incoming requests Computing Element Storage Element RB node

UI Network Server Job Contr. - CondorG Replica Location Server Inform. Service CE characts & status SE characts & status RB storage waiting submitted Job Status WM: acts to satisfy the request Job WM RB node Computing Element Storage Element

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker Where must this job be executed ? RB node Computing Element Storage Element

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service CE characts & status SE characts & status RB storage waiting submitted Job Status RB node Computing Element Storage Element Matchmaker: responsible to find the “best” CE for a job Match- Maker

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker Where are (which SEs) the needed data ? What is the status of the Grid ? Computing Element Storage Element

UI Network Server Job Contr. - CondorG WM Replica Location Server Inform. Service CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker CE choice RB node Computing Element Storage Element

UI Network Server JC Workload Manager Replica Location Server Inform. Service RB node CE characts & status SE characts & status RB storage Job Status Job Controller: responsible for the actual job management operations (done via CondorG) Job submitted waiting ready RB node Computing Element Storage Element

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service CE characts & status SE characts & status RB storage Job Status Job submitted waiting ready scheduled Computing Element Storage Element RB node

Computing Element (CE) is the interface to a Grid computing nodes. The admitted format for CEId is: : /jobmanager- - i.e :lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-dteam A Computing Element is built on a homogeneous farm of computing nodes (called Worker Nodes) - Each LCG-2 site runs at least one CE and a farm of WNs behind it.

Computing Element (Gatekeeper & Clobus-jobmanager) Gatekeeper Grants access to the CE Authentication and authorization more complicate (compare to RB) the gatekeeper accepts requests from Condor-G, forks the globus-jobmanager.Globus-jobmanager Offers an interface to the local batch system. submits or cancel a job.

Computing Element (Batch System) Batch System handles the job execution on the available local farm worker nodes. Batch System consists of: - torque (formerly known as OpenPBS) resource manager. - maui job scheduler.

Worker Node Worker nodes It is the host executing the job. A set of WNs managed by a CE constitues a computing cluster. A cluster MUST be homogeneous. is probably the simplest part of the Grid. The WN runs the job wrapper

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service RB storage Job Status submitted waiting ready scheduled running “Grid enabled” data transfers/ accesses Job Input Sandbox files Computing Element Storage Element RB node

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done Storage Element Computing Element RB node

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service RB storage Job Status submitted waiting ready scheduled running done edg-job-get-output Storage Element Computing Element

UI Network Server Job Contr. - CondorG Workload Manager Replica Location Server Inform. Service RB node RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done cleared Storage Element Computing Element RB node

UI Log Monitor Logging & Bookkeeping Network Server Job Contr. - CondorG Workload Manager LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB LB: receives and stores job events; processes corresponding job status edg-job-status edg-job-get-logging-info Job status Computing Element RB node

Possible job states

Job resubmission If something goes wrong, the WMS tries to reschedule and resubmit the job. Maximum number of resubmissions: RetryCount: JDL attribute MaxRetryCount: attribute in the “RB” configuration file e.g.to disable job resubmission for a particular job: RetryCount=0; in the JDL file

WMS introduction Job Submission Sequence and WMS components User Job submitUser Job submit

Job Preparation Some issues :  What are the characteristics of the job ?  What are the computational requirements?  What are the data requirements of the job?  Are there any software dependencies?

Job Description Language (JDL) Using a Job Description Language (JDL) to describe a job. Based upon Condor’s CLASSified ADvertisement language (ClassAd) A ClassAd syntax : = ;

How to write a Job Description Here is a minimal job description We specified The program to run and its arguments Executable is already on (any) computing node Directed the standard error and output streams to files Told it what to do with the output Executable= “/bin/echo”; Arguments= “Hello World!”; StdError= “stderr”; StdOutput= “stdout”; OutputSandbox = {“stderr”, “stdout”};

JDL: relevant attributes Executable (mandatory) The command name Arguments (optional) Job command line arguments StdInput, StdOutput, StdError (optional) Standard input/output/error of the job Environment List of environment settings needed by the job to run properly InputSandbox (optional) List of files on the UI local disk needed by the job for running The listed files will automatically staged to the remote resource OutputSandbox (optional) List of files, generated by the job, which have to be retrieved

JDL: relevant attributes Requirements Job requirements on computing resources Specified using attributes of all the GLUE attributes of the IS can be used. If not specified, default value defined in UI configuration file is considered Its value is a Boolean expression. Rank Expresses preference (how to rank resources that have already met the Requirements expression) Specified using attributes of resources published in the Information Service If not specified, default value defined in the UI configuration file is considered

JDL: relevant attributes InputData Refers to data used as input by the job: these data are published in the Replica Location Service (RLS) and stored in the SEs) LFNs and/or GUIDs DataAccessProtocol The protocol or the list of protocols which the application is able to speak with for accessing InputData on a given SE OutputSE RB uses it to choose a CE that is compatible with the job and is close to SE

JDL: important notes Input and output sandboxes are intended for relatively small files (few megabytes). Large input files or generating large output files should insteadly read from or write to SE.

Other UI commands > edg-job-list-match Lists resources matching a job description Performs the matchmaking without submitting the job > edg-job-cancel Cancels a given job > edg-job-status Displays the status of the job > edg-job-get-output Returns the job-output (the OutputSandbox files) to the user > edg-job-get-logging-info Displays logging information about submitted jobs Very useful for debug purposes

Job submission $ grid-proxy-init Your identity:/C=TW/O=AS/OU=CC/CN=Horng-Liang Enter GRID pass phrase for this identity: Creating proxy Done Your proxy is valid until: Sun Mar 12 16:03: $ edg-job-submit -o id.txt -vo dteam HelloWorld.jdl The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - The edg_jobId has been saved in the following file: /home/hlshih/JSexercise1/id.txt =====================================================================

Checking the status $ edg-job-status -i id.txt OR $ edg-job-status DDd2KA DDd2KA ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs- dteam reached on: Sun Mar 12 04:30: *************************************************************

Getting the Output $ edg-job-get-output -i id.txt –dir $PWD Retrieving files from host: lcg00124.grid.sinica.edu.tw ( for ) **************************************************************************** JOB GET OUTPUT OUTCOME Output sandbox files for the job: - have been successfully retrieved and stored in the directory: /home/hlshih/hlshih_QUMY4Dxg4TVVLvCaDDd2KA **************************************************************************** $ ls -l /home/hlshih/hlshih_QUMY4Dxg4TVVLvCaDDd2KA total 4 -rw-r--r-- 1 hlshih hlshih 0 Mar 12 04:54 stderr -rw-r--r-- 1 hlshih hlshih 22 Mar 12 04:54 stdout

Reference Job submit explains step-by-step how to submit your job Job Description language How To. 0_2-Document.pdfhttp://server11.infn.it/workload-grid/docs/DataGrid-01-TEN _2-Document.pdf Resource Broker Resource Broker Achitecture and APIs WMS WP1 Workload Management Software - Administrator and User Guide. 1_2.pdfhttp://server11.infn.it/workload-grid/docs/DataGrid-01-TEN _2.pdf WP1 internal documents - more complete list of documents