Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.

Similar presentations


Presentation on theme: "1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi."— Presentation transcript:

1 1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi

2 2 OUTLINE Computing Resources in the Glue Schema  CE is access point to a queue (same as previous EDG)  only one access point envisioned per queue  E.g. edt001.cnaf.infn.it:2119/jobmanager-pbs-short Typical Cluster Configuration Broker Service and Cluster Services  Description of a the interaction  (goal: understand the submission process and how cluster services are stressed) LCG Proposed Cluster Configuration Job submission in this new scenario Discussion Proposal

3 3 Typical cluster layout In DataGrid, the typical cluster configuration is:  One node running: Batch server (e.g. LSF, PBS) MDS GRIS Gatekeeper  Several Worker nodes driven by the batch server for job execution

4 4 Typical cluster layout Worker node Worker node Worker node Worker node … Head node Batch server gatekeepergris queue

5 5 Typical job submission (partial description) Head Node edt001 Batch server gatekeepergris Broker Information Index 1 2 3 Head Node edt002 Batch server gatekeepergris 2 Two different clusters 1.From II, broker gets list of CEs a user can access and that match JDL requirements 2.For each selected CE, the GRIS is contacted to get params in JDL rank option to order 3.The first CE in ordered list is used to run the job (I don’t remember if requirements are checked in 1 or 2)

6 6 Cluster layout @ CERN People@CERN assert:  Gatekeeper service can be heavily loaded when managing several job submissions e.g. gatekeeper design issue; one living process for each submitted job till the end of the computation  For scalability they deploy gatekeeper service in a different node than batch server  They can have several nodes running gatekeeper for the same batch server  They plan to set up a big farm with 400-1500 nodes with a set of O(10) access nodes and only one batch server  LSF can manage O(1000) nodes

7 7 Cluster Layout @ CERN Worker node Worker node Worker node Worker node … Head node Batch server Access node gatekeepergris Access node gatekeepergris Access node gatekeepergris … queue

8 8 Example Head node Batch server LSF Access node edt001 gatekeepergris Access node edt002 gatekeepergris Access node edt00n gatekeepergris … Broker shlo CE:edt001:2119/sh CE:edt001:2119/lo CE:edt002:2119/sh CE:edt002:2119/lo CE:edt00n:2119/sh CE:edt00n:2119/lo Each CE is a different queue for the broker Load balancing made among matching queues Among replicated queues, the rank process always provide the same order Information index

9 9 Cluster layout @ CERN Adv & Disadv +  Can scale to higher number of parallel job submission  From the site manager viewpoint, it provides great flexibility in managing/configuring the farm -  Not envisioned in both EDG schema and Glue schema -> duplication of info E.g. given a queue/CE on a batch server, this will show in the GIS as many times as the number of configured gatekeepers for the batch server managing the queue  The LRMS is stresses from several info providers asking same info; e.g. 10 gatekeepers/GRIS for an LSF server refreshing info each 30s 20 req each min instead of 2… this might be a problem

10 10 Discussion How does the broker deal with this scenario? What are the needed changes to support this?  GIS schema introducing the concept of Access Point; a queue can have several Access Point Defining a quality parameter for an access point (so that the broker can choose the less loaded one)  broker service  monitoring service

11 11 GRIS AND GATEKEEPER SERVICES The important question is:  DO WE NEED TO REPLICATE THE GRIS AS WELL? The gakepeer does not need the GRIS The GRIS at the moment need some info from the gatekeeper  e.g. hostname, port, access to gridmap file If we don’t miss anything else, they can be easily decoupled on different machines

12 12 PROPOSAL Decouple the CE Unique ID from the entry point CE ID should be just a global unique ID for the queue E.g.  /jobmanager- - Introduce a new attribute GlueCEAccessPoint  E.g. = current GlueCEUniqueID ONE GRIS PER BATCH SERVER

13 13 Proposed scenario Worker node Worker node Worker node Worker node … Head node Batch server Access node gatekeeper Access node gatekeeper Access node gatekeeper … queue gris Can run on an access node, on the Head node or on another machine ONLY ONE ISTANCE

14 14 EXAMPLE CE REPRESENTATION WITHIN MODIFIED GLUE SCHEMA FOR LDAP dn: GlueCEUniqueID=edt001.cnaf.infn.it/jobmanager-pbs-short, Mds-Vo-Name=local, o=grid... GlueCEUniqueID: edt001.cnaf.infn.it/jobmanager-pbs-short GlueCEAccessPoint: edt002.cnaf.infn.it:2119/jobmanager-pbs-short GlueCEAccessPoint: edt003.cnaf.infn.it:2119/jobmanager-pbs-short GlueCEAccessPoint: edt004.cnaf.infn.it:2119/jobmanager-pbs-short GlueCEStateFreeCPUs: 5 GlueCEPolicyMaxRunningJobs: 10 GlueCEAccessControlBaseRule: … …

15 15 Broker modification When querying the GRIS, the broker will maintain the several Access Points Once it selects the queue, it will submit to a random gatekepeeper Further improvement*: GlueCEAccessPoint: edt004.cnaf.infn.it:2119/jobmanager-pbs- short: Where CE_AP_LOAD is a metric that let the broker able to rank the access points BENEFITS:  the broker will do less work during matchmaking process (no duplicated GRIS to be queried to get info for ranking on EACH JOB SUBMISSION)  the broker will choose a random gatekeer among the listed as it does now (so no worse behaviour); this can be improved*

16 16 Monitoring No need to deal with replicated info Able to show real number of queues, their state and access points Detailed host info can be aggregated and presented as access node loads


Download ppt "1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi."

Similar presentations


Ads by Google