Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workload Management David Colling Imperial College London.

Similar presentations


Presentation on theme: "Workload Management David Colling Imperial College London."— Presentation transcript:

1 Workload Management David Colling Imperial College London

2 Release 2 is not based on release 1 Whole new architecture (pretty much described in D1.4) More modular I have little practical experience of this new architecture (yet).

3 So what is the new architecture? See D1.4 for details…

4 The architecture User Interface: Although there have been several changes to the architecture, the commands available at the user end are (almost) the same… now edg-job-submit etc Also now apis Network Server The Network Server is a generic network daemon, responsible for accepting incoming requests from the UI (e.g. job submission, job removal), which, if valid, are then passed to the Workload Manager.

5 The architecture Workload manager: The Workload Manager is the core component of the Workload Management System. Given a valid request, it has to take the appropriate actions to satisfy it. To do so, it may need support from other components, which are specific to the different request types.

6 The architecture Resource Broker: This has been turned into one of the modules that help the workload manager, actually 3 sub- modules… Matchmaking Ranking Scheduling Job Adapter The Job Adapter put the finishing touches to the jobs jdl and creates the job wrapper.

7 The architecture Job Controller and CondoG Actually submit the job to the resources and track progress. So how does this all work…

8 Job submission example (for a simple job) UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status

9 Job submission UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status edg-job-submit myjob.jdl Myjob.jdl JobType = Normal; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*, "/home/user/DATA/*"}; OutputSandbox = {sim.err, test.out, sim.log"}; Requirements = other. GlueHostOperatingSystemName == linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2 && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; submitted Job Status UI: allows users to access the functionalities of the WMS Job Description Language (JDL) to specify job characteristics and requirements

10 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Input Sandbox files Job waiting submitted Job Status NS: network daemon responsible for accepting incoming requests Job submission

11 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status WM: responsible to take the appropriate actions to satisfy the request Job Job submission

12 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- maker Where does this job must be executed ? Job submission

13 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker Matchmaker: responsible to find the best CE where to submit a job Job submission

14 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- Maker/ Broker Where are (which SEs) the needed data ? What is the status of the Grid ? Job submission

15 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Match- maker CE choice Job submission

16 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage waiting submitted Job Status Job Adapter JA: responsible for the final touches to the job before performing submission (e.g. creation of wrapper script, etc.) Job submission

17 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Job Status JC: responsible for the actual job management operations (done via CondorG) Job submitted waiting ready Job submission

18 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node CE characts & status SE characts & status RB storage Job Status Job Input Sandbox files submitted waiting ready scheduled Job submission

19 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Input Sandbox submitted waiting ready scheduled running Grid enabled data transfers/ accesses Job Job submission

20 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done Job submission

21 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox submitted waiting ready scheduled running done edg-job-get-output Job submission

22 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog Inform. Service Computing Element Storage Element RB node RB storage Job Status Output Sandbox files submitted waiting ready scheduled running done cleared Job submission

23 UI Log Monitor Logging & Bookkeeping Network Server Job Contr. - CondorG Workload Manager Computing Element RB node LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB LB: receives and stores job events; processes corresponding job status Log of job events edg-job-status Job status Logging and bookkeeping.

24 New functionality… Release 2 of WP 1 software New functionality includes: MPI job submission User APIs Accounting infrastructure (Management have decided not to deploy this for testbed 2) Interactive job support Job logical checkpointing

25 New functionality… All these are implemented… Specify which sort of job using the JobType classad e.g. JobType = Checkpointable However only tested on the WP 1 testbed as yet… Dont have time to go through all of these so will just will just go through checkpointing.

26 Job checkpointing scenario UI Network Server Job Contr. - CondorG Workload Manager RB node Computing Element X Computing Element Y Logging & Bookkeeping Server

27 UI Network Server Job Contr. - CondorG Workload Manager Replica Catalog RB node submitted Job Statu s UI: allows users to access the functionalities of the WMS edg-job-submit jobchkpt.jdl jobchkpt.jdl [JobType = Checkpointable; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe, OutputSandbox = Outfile, Requirements = member("ROOT", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Computing Element X Computing Element Y Logging & Bookkeeping Server Job Description Language (JDL) to specify job characteristics and requirements

28 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Input Sandbox files Job Input Sandbox files Job Match- maker Job Adapter

29 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job From time to time users job asks to save the intermediate state … ; State.saveValue(var1, value1>; … State.saveValue(varn, valuen); State.saveState(); …

30 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running Logging & Bookkeeping Server Saving of job state Saving of intermediate files Computing Element X Computing Element Y Job

31 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status submitted waiting ready scheduled running done (failed ) Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job fails (e.g. for a CE problem)

32 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker Reschedule and resubmit job submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server Job Where must this job be executed ? Possibly on a different CE where the job was previously submitted …

33 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Match- maker CE choice: CEy submitted waiting ready scheduled running done (failed ) waiting Computing Element X Computing Element Y Logging & Bookkeeping Server

34 UI Network Server Job Contr. - CondorG Workload Manager RB node CE characts & status RB storage Job Status Job Adapter Computing Element X Computing Element Y Logging & Bookkeeping Server Job ready scheduled running done (failed ) waiting ready

35 Computing Element X Computing Element Y UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job Input Sandbox files ready scheduled running done (failed ) waiting ready scheduled Logging & Bookkeeping Server Job

36 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Retrieval of last saved state when job starts Retrieval of intermediate files (previously saved) scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job

37 UI Network Server Job Contr. - CondorG Workload Manager RB node RB storage Job Status Job scheduled running done (failed ) waiting ready scheduled running Computing Element X Computing Element Y Logging & Bookkeeping Server Job Job keeps running starting from the point corresponding to the retrieved state (doesnt need to start from the beginning)

38 Further additional functionality The order of implementation is not up to WP 1 people… Dependent jobs: Using Condor DAGMan For example…

39 A = [ Executable = "A.sh"; PreScript = "PreA.sh"; PreScriptArguments = { "1" }; Children = { "B", "C" } ]; B = [ Executable = "B.sh"; PostScript = "PostA.sh"; PostScriptArguments = { "$RETURN" }; Children = { "D" } ]; C = [ Executable = "C.sh"; Children = { "D" } ]; D = [ Executable = "D.sh"; PreScript = "PreD.sh"; PostScript = "PostD.sh"; PostScriptArguments = { "1", "a" } ] Further additional functionality

40 Job partitioning will be similar to checkpointing, with the jobs being partitioned according to some variable. Partitioned jobs will also have a pre-job and aggregator e.g. Further additional functionality

41 JobType = Partitionable; Executable =...; JobSteps =...; StepWeight =...; Requirements =...;... Prejob = [ Executable =... Requirements =...;... Aggregator = [ Executable =... Requirements =...;... ]; Further additional functionality

42 Also planned is advanced reservation of resources and co-location. Much more monitoring and performance quantification…

43 Summary New architecture has been implemented Lots of new functionality … but not stress tested Further functionality and performance quantification implemented by testbed 3.

44 Further into the future… EDG will not use OGSA, however the future is in the OGSA grid world. Work is being done at LeSC (See Steven Newhouses talk tomorrow) to wrap the WP 1 components. Communication via JDML and LBML Virtualisation of RB through OGSA factory Use virtualisation to load balance Increase interoperability


Download ppt "Workload Management David Colling Imperial College London."

Similar presentations


Ads by Google