Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Similar presentations


Presentation on theme: "Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova."— Presentation transcript:

1 Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

2 Evaluation of GRAM Service GRAM CONDOR GRAM LSF GRAM PBS Site1 Site2Site3 Submit jobs (using Globus tools) GIS Information on characteristics and status of local resources

3 Evaluation of GRAM Service Job submission tests using Globus tools (globusrun, globus-job-run, globus-job- submit) GRAM as uniform interface to different underlying resource management systems “Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to specify resources Tests performed with Globus 1.1.2 and 1.1.3 and Linux machines

4 GRAM & fork system call Client Server (fork) Globus

5 GRAM & Condor Client Server (Condor front-end machine) Globus Condor Condor pool

6 GRAM & Condor Tests considering: Standard Condor jobs (relinked with Condor library) INFN WAN Condor pool configured as Globus resource ~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain Vanilla jobs (“normal” jobs) PC farm configured as Globus resource Single file system and UID domain

7 GRAM & LSF Server (LSF front-end machine) Client Globus LSF Cluster

8 Results Some bugs found and fixed (fixes included in INFNGRID 1.1 distribution) Standard output and error for vanilla Condor jobs globus-job-status … Some bugs can be solved without major re-design and/or re- implementation: For LSF the RSL parameter (count=x) is translated into: bsub –n x … Just allocates x processors, and dispatches the job to the first one Used for parallel applications Should be: bsub … x times Maybe we don’t need to solve this problem (see later…) … Two major problems: Scalability Fault tolerance

9 Globus GRAM Architecture Client LSF/ Condor/ PBS/ … Globus front-end machine Jobmanager Job pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename (stdout=/diskCms/Cmsim/filename) (count=1) pc1 pc2

10 Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ??? 1000 globusrun  1000 jobmanagers running in the front-end machine !!! %globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename) (stdout=/diskCms/CmsimOut/filename) (count=1000) It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … $(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !

11 Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed Example of problem Submission of n jobs on a cluster managed by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart Orphan jobs  Globus assumes that the jobs have been successfully completed

12 GRAM & GIS How the local GRAMs provide the GIS with characteristics and status of local resources ? Tests performed considering: Condor pool LSF cluster

13 GRAM & Condor & GIS

14 GRAM & LSF & GIS Must be fixed

15 Jobs & GIS Info on Globus jobs published in the GIS: User Subject of certificate Local user name RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

16 GRAM & GIS The information on characteristics and status of local resources and on jobs is not enough As local resources we must consider Farms and not the single workstations Other information (i.e. total and available CPU power) needed Fortunately the default schema can be integrated with other info provided by specific agents The needed information must be identified first

17 RSL We need a uniform language to specify resources, between different resource management systems The RSL syntax model seems suitable to define even complicated resource specification expressions The common set of RSL attributes is often not sufficient The attributes not belonging to the common set are ignored

18 RSL More flexibility is required Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

19 Next steps Bug fixes Modification of Globus LSF scripts for GIS Problem (count=x) with LSF ??? Tests with real applications and real environments (CMS fall production) Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it Let’s start with information provided by the underlying resource management system Tests with GRAM API Not necessary tests with other resource management systems Scalability and robustness problems Not so simple and straightforward !!! Up to Workload management WP, possible collaboration with Globus team and Condor team

20 Other info http://www.pd.infn.it/~sgaravat/ INFN-GRID http://www.pd.infn.it/~sgaravat/ INFN-GRID/Globus

21 GRAM & PBS (by F. Giacomini-INFN Cnaf) Client Server (PBS) Globus PBS Linux Server (4 processors)

22 Condor GlideIn Submission of Condor jobs on Globus resources Condor daemons (master, startd) run on Globus resources These resources temporarily become part of the Condor pool Usage of Condor-G to run Condor daemons Local resource management systems (LSF, PBS, …) of Globus resources used only to run Condor daemons For a cluster it is necessary to install Globus only on one front-end machine, while the Condor daemons will run on each workstation of the cluster

23 GlideIn pc3 Cluster managed by LSF/Condor/… Globus Personal Condor Globus pc1 pc2 pc1% condor_glidein pc2.pd.infn.it … pc1% condor_glidein pc3.pd.infn.it …

24 Condor GlideIn Usage of all Condor mechanisms and capabilities Robustness and fault tolerance Only “ready-to-use” solution if we want to use Globus tools Also Master functionalities (Condor matchmaking system) Viable solution if the goal is just to find idle CPUs The architecture must be integrated/modified if we have to take into account other parameters (i.e. location of input files)

25 Condor GlideIn GlideIn tested (considering standard and vanilla jobs) with: Workstation using the fork system call as job manager Seems working Condor pool Seems working Condor flocking better solution if authentication is not required LSF cluster Problems with glidein of multiple nodes with a single condor_glidein command (because of the problem related with the LSF parameter (count=x))  Multiple condor_glidein commands  Problems of scalability (number of jobmanagers)  Modification of Globus scripts for LSF


Download ppt "Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova."

Similar presentations


Ads by Google