David Colling Imperial College London Running your jobs everywhere.

Slides:



Advertisements
Similar presentations
Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005.
Advertisements

Workload Management David Colling Imperial College London.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
The Grid Constantinos Kourouyiannis Ξ Architecture Group.
Job Submission The European DataGrid Project Team
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Submission Fokke Dijkstra RuG/SARA Grid.
EGEE is a project funded by the European Union under contract IST Grid developments and middleware components Mike Mineter EGEE Training team.
The EDG Workload Management System – n° 1 The EDG Workload Management System.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
Grid Infrastructure.
INFSO-RI Enabling Grids for E-sciencE Grid Infrastructure & Related Projects Eddie Aronovich Tel-Aviv University, School of CS
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Job Submission The European DataGrid Project Team
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto - INFN Padova Francesco Prelz – INFN Milano.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Computational grids and grids projects DSS,
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards.
M. Sgaravatto – n° 1 The EDG Workload Management System: release 2 Massimo Sgaravatto INFN Padova - DataGrid WP1
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Nadia LAJILI User Interface User Interface 4 Février 2002.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
1 Esther Montes Prado CIEMAT 10th EELA Tutorial Madrid, Hands-on on WMS (Review and Summary)
Grid Workload Management Massimo Sgaravatto INFN Padova.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Job Submission The European DataGrid Project Team
EGEE is a project funded by the European Union under contract IST Middleware components in EGEE Mike Mineter NeSC Training team
Job Submission and Resource Brokering WP 1. Contents: The components What (should) works now and configuration How to submit jobs … the UI and JDL The.
Grid checkpointing in the European DataGrid Project Alessio Gianelle – INFN Padova Rosario Peluso – INFN Padova Francesco Prelz – INFN Milano Massimo Sgaravatto.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Submission Fokke Dijkstra RuG/SARA Grid.
EGEE is a project funded by the European Union under contract IST Job Description Language - more control over your Job Assaf Gottlieb University.
CLRC and the European DataGrid Middleware Information and Monitoring Services The current information service is built on the hierarchical database OpenLDAP.
Review of Condor,SGE,LSF,PBS
EGEE is a project funded by the European Union under contract IST EGEE Tutorial Turin, January Job Services Emidio.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova.
E-infrastructure shared between Europe and Latin America 1 Workload Management System-WMS Luciano Diaz Universidad Nacional Autónoma de México - UNAM Mexico.
Enabling Grids for E-sciencE Workload Management System on gLite middleware - commands Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi.
EGEE-0 / LCG-2 middleware Practical.
Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR.
Induction: General components of Grid middleware and User Interfaces –April 26-28, General components of Grid middleware and User Interfaces Roberto.
EGEE is a project funded by the European Union under contract IST Job Description Language – How to control your Job Nadav Grossaug IsraGrid.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Job Submission The European DataGrid Project Team
Biomed tutorial 1 Enabling Grids for E-sciencE INFSO-RI EGEE is a project funded by the European Union under contract IST JDL Flavia.
User Interface UI TP: UI User Interface installation & configuration.
LCG2 Tutorial Viet Tran Institute of Informatics Slovakia.
Istituto Nazionale di Astrofisica Information Technology Unit INAF-SI Job with data management Giuliano Taffoni.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Overview of gLite, the EGEE middleware Mike Mineter Training Outreach Education National.
EGEE is a project funded by the European Union under contract IST GENIUS and GILDA Guy Warner NeSC Training Team Induction to Grid Computing.
GRID commands lines Original presentation from David Bouvet CC/IN2P3/CNRS.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Grid Computing: Running your Jobs around the World
EGEE tutorial, Job Description Language - more control over your Job Assaf Gottlieb Tel-Aviv University EGEE is a project.
Job Submission in the DataGrid Workload Management System
5. Job Submission Grid Computing.
The EU DataGrid Job Submission Services
EGEE Middleware: gLite Information Systems (IS)
Presentation transcript:

David Colling Imperial College London Running your jobs everywhere

David Colling Imperial College London What do you want from a Grid job submission system? Well, I cannot answer for you but this is my guess… You want the Grid to be as easy to use as a conventional, local computing batch system. This means:  Simple “qsub” style commands  All the data transparently available to the job  You can monitor your jobs. You may want more… but these are the basics

David Colling Imperial College London So if we have a whole range of local batch system that can satisfy these criteria, why is doing this on the Grid so difficult? Some of the problems of a distributed computing system are:  Not all data is distributed to every site  You do not have computer accounts at every site (See Andrew’s Talk)  Your jobs are travelling across the WAN and so additional security is required (See Andrew’s Talk)  Difficult to gather coherent information about the remote sites.  Everything (network, computer, disks etc) breaks.

David Colling Imperial College London So how do we overcome these problems? This is the subject of this talk… This is going to be a conceptual treatment… I am going to describe the solution developed by the European DataGrid Project (EDG) and now adopted by the LHC Computing Grid (LCG) and will be the basis of the first EGEE release. There are several other Grid projects (e.g. see Rick’s talk), however are conceptually very similar… although they do have important technical differences These are my personal views

David Colling Imperial College London The World as seen by the EDG Each Site consists of: A compute element A storage element Sites are not identical.  Different Computers  Different Storage  Different Files  Different Usage Policies Confused and unhappy user This is the world without Grids So lets introduce some grid infrastructure… Security and an information system So now the user knows about what machines are out there and can communicate with them… however where to submit the job is too complex a decision for user alone. What is needed is an automated system Workload Management System (Resource Broker) Replica Location service (Replicac Catalogue) edg-job-submit myjob.jdl Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; Job & Input Sandbox WMS using RC decide on execution location edg-job-get-output Now a happy user Logging & Bookkeeping VO server

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Security  GSI security model based on X.509 provides authentication  Authorisation via membership of virtual organisations (VO) and group pool accounts If well implemented this is secure (I am told) and provides a way authorising access to resources on which individuals do not have personal accounts

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Two different information systems have been tried within EDG/LCG. Hierarchical LDAP based system  Each site publishes a set of information about itself  This was slow and didn’t scale well  Improvements in later versions R-GMA (Relational-Grid Monitoring Architecture)  Works on serverlets  Allows user to implement their own monitoring by implementing their executables.  Seems to scale

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? So it appears that we have the information system that we need to be able get a coherent picture of our Grid

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Not all the the data are at every site  The Replica Location Service knows about all the physical copies of the data.  User specifies logical file name  Can feed information into the WMS and provides sufficient information to the user job to be able find the data it needs  Users can also register the output files  Still some scaling issues Ways of handling the data are being developed and work for reasonable numbers of files

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? At the heart of the EDG/LCG is the WMS  Takes the job along with its description in ClassAd format and input sandbox from the user  Uses this description, information about the state of the resources and data location to decide an execution location  Submits job to selected resource  Returns output to the user after the job has completed Built on Globus and CondorG as well as original code

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Is it straight forward to use? Need to describe the job ######################################### # # ---- Sample Job Description File ---- # ######################################### JobType = "Normal"; Executable = "sum.exe"; StdInput = "data.in"; InputSandbox = {"/home_firefox/fpacini/exe/sum.exe","/home1/data.in"}; OutputSandbox = {"data.out","sum.err"}; InputData = {"lfn:CARF_System.META.TestG4"}; Rank = other.GlueCEPolicyMaxCPUTime; Requirements = other.GlueCEInfoLRMSType == "Condor" && other.GlueHostArchitecturePlatformType== "INTEL" && other.GlueHostOperatingSystemName == "LINUX" && other.GlueCEStateFreeCPUs >= 2;

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Are the commands easy to use? Some typical commands: edg-job-list-match myjob.jdl *************************************************************************** Computing Element IDs LIST The following CE(s) matching your job requirements have been found: *CEId* bbq.mi.infn.it:2119/jobmanager-pbs-dque skurut.cesnet.cz:2119/jobmanager-pbs-wp1 ***************************************************************************

David Colling Imperial College London edg-job-submit –vo cms myjob1.jdl ================= edg-job-submit Success ================================== The job has been successfully submitted to the Network Server. Your job is identified by (edg_jobId): Use edg-job-status command to display current job status. ====================================================================== $> edg-job-status –v 0 ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : Current Status: Scheduled Destination: bbq.mi.infn.it:2119/jobmanager-pbs-dque Status Reason: Job successfully submitted to Globus reached on: Tue May 6 16:14: ************************************************************* So does this system fulfil the requirements and overcome the problems?

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? edg-job-cancel edg-job-get-logging-info etc The commands are as easy to use as other batch systems.

David Colling Imperial College London So does this system fulfil the requirements and overcome the problems? Robustness  Uses CondorG  Job retries at a new site if it fails at original (up to a number specified in the classads)

David Colling Imperial College London However could still be inefficient (e.g. Job runs for hours or days before machine crashes) so introduced logical checkpointing.  Value Attribute pairs are periodically saved to the LB service  If job fails because of a CE problem it can restart from last saved state  Provides a natural way dividing up parameter scanning jobs Still not perfectly robust, but is getting there. So does this system fulfil the requirements and overcome the problems? The WMS is becoming robust

David Colling Imperial College London The EDG release satisfies our basic requirements (pretty much) However also has additional functionality In the current release:  Support for interactive jobs  Support for MPI jobs Implemented but not yet released  Dependent jobs (DAGs)  A distributed accounting system based on the Home Location Registers

David Colling Imperial College London DAGs A = [ Executable = "A.sh"; PreScript = "PreA.sh"; PreScriptArguments = { "1" }; Children = { "B", "C" } ]; B = [ Executable = "B.sh"; PostScript = "PostA.sh"; PostScriptArguments = { "$RETURN" }; Children = { "D" } ]; C = [ Executable = "C.sh"; Children = { "D" } ]; D = [ Executable = "D.sh"; PreScript = "PreD.sh"; PostScript = "PostD.sh"; PostScriptArguments = { "1", "a" } ]

David Colling Imperial College London Notes of caution: Yes, the system works and is now pretty robust, however problems do still occur. Constant monitoring is required or else site configurations seem to decay. This is problem of many interacting pieces of software and problems at an individual site can go overlooked as jobs are just resubmitted else where.

David Colling Imperial College London So what is there now? /~stuatw/applet/ (links from ) EDG Application testbed: More than 1000 CPUs 5 Terabyte of storage EDG sw installed at more than 40 sites 60K Successful jobs since Oct 2003 (current release)

David Colling Imperial College London So what is there now? The LCG testbed (at the time SC2003) So you really can submit your jobs around the world

David Colling Imperial College London How to get start using the Grid Get a certificate: Sign the EDG and LCG usage rules: (Soon EDG to be replaced by EGEE) You will then become a member of a VO

David Colling Imperial College London Follow examples in the user guides G-Users-Guide-2.0.pdf G-Users-Guide-2.0.pdf User Support is currently limited, but will grow significantly over the next few months. How to get start using the Grid

David Colling Imperial College London And in the future…? The future will bring changes in the underlying technology Almost certainly based on Web Services However the functionality required will not change very much and the LCG and EGEE users should be shielded from these changes.

David Colling Imperial College London Summary Over the last three years the EDG has developed a working Grid that is fulfils the basic user requirements. This has been adopted by LCG and EGEE. We are approaching a production scientific service. The future may be based on new technology but will look similar to the user