Presentation on theme: "CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver."— Presentation transcript:
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver
CERN David.Smith@cern.ch2 Introduction LCG overview: –Contains software for Workload Management Data Management Information System/Monitoring Services Storage Management Scaling concerns –Workload management built on top of the Globus Toolkit, 2.2.4. –Discuss only Workload Management issues here, specifically those related to Globus
CERN David.Smith@cern.ch3 The problem LCG not development project, but the underlying Globus toolkit has some scaling features that impact us –Jobs managed by a ‘JobManager’. –Interface between the JobManager and the local batch system performed by an intermediate layer. (Perl script) Two broad problem areas: –Access to a shared file system area between all batch workers is assumed by the JobManager Scripts. –Inherent scaling problems due to one instance of the JobManager associated to a user job.
CERN David.Smith@cern.ch4 GRAM Client Job Control Globus Gatekeeper Globus Job Manager Job Manager scriptLocal Batch system
CERN David.Smith@cern.ch5 Shared file system Shared file system not acceptable to everyone. Wanted to remove the requirement of sharing between batch workers Submission model is that of a central entity that receives and handles job queries and submission to local batch system. This is called the Globus gatekeeper. Shared file system requirement comes from need to make X509 certificate available to job Stdout and stderr available to the gatekeeper after (during) the job execution.
CERN David.Smith@cern.ch6 Steps in submitting a job Gatekeeper proceeds through a number of states: Stage in of files required by job Copying X509 user proxy Submit job to batch system Monitor job status in the batch system Receive refreshed X509 proxy during lifetime of job Allow access to stdout/stderr of job. Optionally return output files Cleanup and free resources held for job Gatekeeper Worker1Worker2
CERN David.Smith@cern.ch7 The GASS cache Uses GASS cache: a file system based database that allows an instance of a file to be associated to a URL and a TAG Globus provides IO routines to access both local and remote GASS cache entries How to avoid sharing the file system containing GASS cache? –Export entries at start of job, create local cache on target batch worker during the life of the job. –At the end of the job return the contents of the cache and add back to the cache the gatekeeper is working with.
CERN David.Smith@cern.ch8 Exporting the GASS cache GASS caches… Cache on gatekeeper: Entries for Job1 Job2 Job3 … Worker2: Local cache for Job2 Worker1: Local cache for Job1 Worker3: Local cache for Job3
CERN David.Smith@cern.ch9 More on cache handling Exporting and importing is done using a globus-url-copy (FTP) Special considerations: Initial X509 certificate required to start import of cache. –Use stage in facility of batch system. (For PBS this implies scp access from the batch worker to the gatekeeper machine) X509 proxy certificate needs to be updated during the life of a job. –Pull proxy from gatekeeper cache when the proxy on the batch worker is near expiry. Stdout and Stderr from the job are returned as entries in the cache. – The local batch system will also have a mechanism to return these. If used the two are concatenated. The globus mechanism for staging in/out files will also need explicit coping of the staged in file set from the gatekeeper and the return of files to be staged out.
CERN David.Smith@cern.ch10 JobManager handling The other problem… JobManager associated to each job –By default there is a job manager in the process table on the gatekeeper machine for each globus job submitted or running. Limited by number of processes Limited by memory available Limited by scheduling or other system resources –Each JobManager also needs to periodically query the state of the job Needs to fire up a JobManager script and use batch system commands There is already a solution for this… –Condor-G already used by LCG as the job submission service –Condor team have an interesting way to address the number of JobManagers
CERN David.Smith@cern.ch11 Condor-G solution Condor-G solution is to make use of existing GRAM facility: –Once a job is submitted to the local batch system the associated JobManager can signaled to exit. –Not much use in itself, as it must be restarted in order to query the job’s status in the batch system. Condor-G can run a special ‘grid monitor’ task on the gatekeeper machine, on behalf of each user: –This calls the JobManager script interface to query the status of each job in turn from the batch system. The status list for all of the jobs is returned periodically to the Condor-G machine. –For jobs that have left the batch system a JobManager is restarted and the final stages of the job are concluded as normal.
CERN David.Smith@cern.ch12 The Condor-G grid monitor 1 For a given user… Job1 Stage in Submit to bs Poll status Stage out Cleanup JobManager Job2 Stage in Submit to bs Poll status Stage out Cleanup Manager killed Job3 Stage in Submit to bs Poll status Stage out Cleanup Grid monitor for user Condor-G machine
CERN David.Smith@cern.ch13 Grid monitor 2 Query Job 1 JM Script to Poll Job 1 Query Job 2 JM Script to Poll Job 2 Return results Parse list of jobs Wait for next poll
CERN David.Smith@cern.ch14 Remaining issues Still some problems: Potentially large load on the batch system, series of queries every check interval. For large number of jobs the check time can far exceed the check interval. The system is tightly coupled to the batch system. –Slow response to the queries or submission requests can rapidly cause the gatekeeper to become process bound or prevent the grid monitor returning any results to Condor-G.
CERN David.Smith@cern.ch15 Total query time Partially address the problem of total query time by making some optimisation in the grid manager: Query only as many jobs as is possible in one scan period. –Assume that the others have not changed state since last query –Start with the jobs whose status is most aged in the next cycle Parse list of jobsParse list of changed jobs Query Job 3 JM Script to Poll Job 3 Query Job 4 JM Script to Poll Job 4 Query Job 1 JM Script to Poll Job 1
CERN David.Smith@cern.ch16 Changes to JobManager scripts Address coupling to batch system and batch system load through the JobManager scripts themselves: –Globus supply JobManager interfaces to Condor, LSF and PBS –Wrote lcg versions of these New job managers for import/export of GASS cache to batch workers Architecture change to address remaining batch system issues –Batch load to be reduced by caching batch system query –(45 second cache) –Less coupling to batch system by introducing queues at various stages of the job cycle.
CERN David.Smith@cern.ch17 Job progression through LCG JM Add queues and service by asynchronous processes Job1 Stage in Submit to bs Poll status Stage out Cleanup Export and Submission queue Cleanup queue Import queue Grid monitor for user Batch status cache Queue & Cache Service processes
CERN David.Smith@cern.ch18 Summary Globus toolkit has scaling limitations in the job submission model Condor-G already has an interesting solution –Optimisation possible LCG flavour of the JobManager scripts –Avoid the necessity of sharing the gatekeeper GASS cache to the batch worker machines –Loosen the binding to the batch system –Reduce batch system query frequency
CERN David.Smith@cern.ch19 Future The work so far should allow 1000s of jobs to be handled by a gatekeeper –However still little work done on scaling with number of users In the future it may be a consideration to consider changing the Globus JobManager itself –Both grid monitor and LCG JobManagers are trying to deal with issues related to the underlying JobManager and GRAM design.