WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Workload Management David Colling Imperial College London.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Job Submission The European DataGrid Project Team
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Submission Fokke Dijkstra RuG/SARA Grid.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto - INFN Padova Francesco Prelz – INFN Milano.
“Grey areas” of the new architecture Massimo Sgaravatto INFN Padova.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – M. Sgaravatto – n° 1 The EU DataGrid Workload Management System: towards.
M. Sgaravatto – n° 1 The EDG Workload Management System: release 2 Massimo Sgaravatto INFN Padova - DataGrid WP1
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
F.Pacini - Milan - 8 May, n° 1 Results of Meeting on Workload Manager Components Interaction DataGrid WP1 F. Pacini
Grid Workload Management Massimo Sgaravatto INFN Padova.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
M. Sgaravatto – n° 1 Overview of WP1 Workload Management System in EDG 2.x Massimo Sgaravatto INFN Padova - DataGrid WP1
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
INFSO-RI Enabling Grids for E-sciencE WMS + LB Installation Emidio Giorgio Giuseppe La Rocca INFN EGEE Tutorial, Rome November.2005.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM and ICE Massimo Sgaravatto – INFN Padova.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova.
VO management: Progress since Chicago Workshop Vincenzo Ciaschini 23/5/2002 CNAF – Bologna.
DGC Paris WP2 Summary of Discussions and Plans Peter Z. Kunszt And the WP2 team.
Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract IST Job sandboxes.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
Data Management The European DataGrid Project Team
Author - Title- Date - n° 1 Partner Logo WP5 Status John Gordon Budapest September 2002.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.
Job Submission The European DataGrid Project Team
EGEE is a project funded by the European Union under contract IST Experiment Software Installation toolkit on LCG-2
EGEE is a project funded by the European Union under contract IST LCG open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting,
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
EU 2nd Year Review – Feb – WP1 Demo – n° 1 WP1 demo Grid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 )
LCG and Glite open issues Massimo Sgaravatto INFN Padova
WP1 WMS release 2: status and open issues
Workload Management System ( WMS )
Preview Testbed Massimo Sgaravatto – INFN Padova
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Job Submission in the DataGrid Workload Management System
Introduction to Grid Technology
CRC exercises Not happy with the way the document for testbed architecture is progressing More a collection of contributions from the mware groups rather.
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Job Submission M. Jouvin (LAL-Orsay)
Presentation transcript:

WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova

Outline Some issues to discuss (and let’s try to decide) LB server choice New CondorG Proxy renewal RLS integration WP2 Optor integration Output data upload and registration LB issues Gangmatching Security of files on the WM node Disk quota management in WM node VOMS integration Job exit code ISB/OSB transfer errors Accounting integration User vs host proxies … ?

LB server choice Allow multiple LB servers for a single WM for increased reliability and performance Approach UI responsible to choose the LB server (e.g. via a round robin) ? List of available LB servers in UI conf file, waiting for having this VO specific info published in a “VO repository” (R-GMA/IS/VOMS) ? Move list of available NSs in this VO repository as well, when available Not too clear yet what could be this VO repository (discussions within ATF)

New CondorG New CondorG negotiated with Condor people (more details by Francesco P.) Released by end of March, included in VDT, and to be used in rel 2.0 Two proxies X509UserProxy One per job X509ManagementProxy One per user’s DN or one “serving” n jobs for that user’s DN A CondorG pair for a given X509ManagementProxy Details on the whole machinery to be discussed Where is this user’s DN  X509ManagementProxy mapping kept and managed ? Proxy renewal ? …

Proxy renewal Necessary to have a “persistent” proxy renewal daemon (i.e. if it is restarted it shouldn’t loose control of the “managed” jobs as it happens now) Necessary to discuss and decide on various issues Renewal of X509UserProxy Done only if requested by the user (if MyProxyServer specified in the JDL ?) ? No MyproxyServer in WM conf file anymore ? And what about renewal of X509ManagementProxy ? If a new proxy “arrives” from UI and extends the validity of the existing one, the new one replace the old one ? Not enough: what about if at least a job of that user asked for proxy renewal ? Necessary to renew also X509ManagementProxy Who does registration ? NS ? Who does un-registration ?? …

RLS integration At J+27 RB/MM will have to query the WP2 RLS instead of WP2 RC to get the SFNs given a LFN (or LCN, or a GUID) On-going negotiation of this WP1-WP2 interface New JDL attribute (VirtualOrganization) to make possible to refer to the “official” VO’s RLS (needed by WP2 services) Not needed anymore when VOMS integrated and therefore it will be possible to get the VO from user’s proxy Optional JDL attribute to make possible to specify a “non- official” RLS ? edgReplicaManager::listReplicas to have the SFNs New BrokerInfo content (under negotiation)

Integration with WP2 Optor Completely different approach than querying the RLS to have the PFNs (mutually exclusive) … RB calls getAccessCost for all the suitable CEs (the ones where the user is authorized to submit jobs and matching the JDL “Requirements” expression) and for all the specified input data (LFNs, LCNs, GUIDs) A “cost” is returned for each CE The RB chooses the CE, taking into account this cost and also the other Ranks (to be decided how) In some cases the WM has also to trigger the replica of files to the closeSE Not too difficult, but very high impact on scheduling/planning performed by RB/MM Integration WMS-Optor Planned after J+27 However according to WP2, this stuff ready and tested well before J+27 To discuss details of integration How ? A binary flag in the WM conf file to enable/disable Optor ? When ?

Output data upload and registration Problem discussed and solution agreed in the ATF Approach (details by Fabrizio P.): OutputData JDL attribute (optional) to specify output file names, output LFNs and output SEs Jobwrapper at the end has to call the WP2 function copyAndRegister Issues Some details about copyAndRegister to be sorted out Release date of this stuff not decided yet

LB What happens exactly at J+27 wrt: “Advanced query to LB” ? “LB – RGMA integration” ? How ? Interfaces (e.g. for advanced queries) ? Issues ? Ales ??

Gangmatching Problem: take into account both CE and SE information in the matchmaking For example to require a job to run on a CE close to a SE with “enough space” Salvo has been working on this for a while, also after some negotiations with Condor team (A. Roy) Salvo’s talk for details (e.g. JDL) and discussions When can this stuff be released ? J+27 ?

Security of files on the WM node Approach WP1 services (NS, …) running as edguser.edguser in WM node Different user’s subjects mapped to different local users in grid-mapfile: user1.user, user2.user, … Patched gridftp server (by Massimo M.) running on the NS node, so that the InputSandbox files are transferred in the NS node belonging to edguser as group and rwxrwx--- as mask So a user can not access files belonging to an other user anymore Issues When ? J+27 ? How ? Gridftp server RPM released by WP1 ?

Disk quota management on the WM node Having different DN users mapped to different local users in the grid-mapfile of the WM node allows to set disk quota for the various users NS to be modified (for J+27) so that it has to reject a job if no enough disk quota available to store the input sandbox files Issues ? Marco ??

VOMS integration E.g.: voms-proxy-init –vo CMS VO info in the generated proxy Impact on WP1 software Retrieve VO from user’s proxy So not necessary to provide it anymore in the JDL, for querying the RLS Check for authorization not node anymore with a matchmaking considering User Cert Subject but according to VO Proxy used by the various services (NS, LB, etc.) generated by VOMS ? Issues VOMS deployed at J+37 but not too clear which and when integration will take place Not clear yet which VOMS APIs available

Job exit code For release 2.0 we agreed to return job exit code to user with dg-job-status What about if exit code <> 0 ? Done-ok in any case ? Done-failed (and therefore resubmission) ?

ISB/OSB transfer errors In release 1.x job considered failed (and therefore resubmission attempted) if JobWrapper detects errors when transferring a file of ISB/OSB between RB node and WN But failure could be simply because of user’s error when writing ISB/OSB expressions in JDL … And what about if the job crashed for “internal” problems and therefore some OSB files not produced ? Is it ok to mark the job as failed and re-attempt the submission or is it better to consider the job as done-ok ? Approach in release 2.0 JobAdapter should check and issue globus-url-copy only for ISB- OSB files which exist (simple for OSB, bit more complex for ISB) and/or globus-url-copy errors ignored ?

Accounting integration What exactly happens at J+27 (“Accounting infrastructure”) ? And later, after release 2.0 (“Full integration of cost estimation/accouting into scheduling policies”) ? Dependencies and interfaces with other components and other WPs at J+27 and later ?

Host vs user proxies Can we rely on user’s proxies instead of host proxies for authentication when possible, as recommended ? E.g. in LB logging Other cases ?