Presentation on theme: "HepSysMan 1 St /2 nd July 2004 Maui. HepSysMan 1 St /2 nd July 2004 Batch System Batch Server (pbsserver) Execution host (pbsmom Execution host (pbsmom)"— Presentation transcript:
HepSysMan 1 St /2 nd July 2004 Batch System Batch Server (pbsserver) Execution host (pbsmom Execution host (pbsmom) Execution host (pbsmom) Batch server and cluster Configuration, Job queue, State table Execution host (pbsmom Job, start, stop, status qsub, qdel, qstat Scheduler plug-in Node, job, start, stop status Job, start, stop, status Scheduler and additional cluster Configuration
HepSysMan 1 St /2 nd July 2004 Maui scheduler Seems to originate at Maui High Performance Computing Centre (MHPCC) http://www.mhpcc.edu http://www.mhpcc.edu But now available from http://www.supercluster.org/maui/ in Covered Bridge Canyon, Utah http://www.supercluster.org/maui/
HepSysMan 1 St /2 nd July 2004 Maui/PBS Integration [martin@masternode martin]$ qmgr Max open servers: 4 Qmgr: list server Server masternode server_state = Idle scheduling = False default_queue = dque log_events = 127 mail_from = adm query_other_jobs = True resources_default.walltime = 00:01:00 scheduler_iteration = 60 node_pack = False pbs_version = OpenPBS_2.4 # maui.cfg 3.2 # # 18/5/04 built by maui with extras added by xCAT and the 12Mar04 version # SERVERHOST masternode # primary admin must be first in list ADMIN1 root RMCFG[base] TYPE=PBS RMPOLLINTERVAL 00:01:00 SERVERPORT 42559 SERVERMODE NORMAL
HepSysMan 1 St /2 nd July 2004 Maui Philosophy (1) Maui is particularly concerned about scheduling multiprocessor jobs How do you arrange a matching set of processors to be simultaneously available for a single job ? Maui tries to plan the execution of such jobs at a particular time when it expects sufficient processors to be available - on the basis of the job maximum walltime parameters. It establishes reservations on a set of processors for a job – ensuring all the processors are free at the planned time
HepSysMan 1 St /2 nd July 2004 Reservations Job 12340 Reservation for job 12345 Job 12341Reservation for job 12345 Job 12345 Reservation for job 12345 Job 12343Reservation for job 12345 Job 12344Reservation for job 12345 Job 12340 Reservation for job 12345 walltime cpu
HepSysMan 1 St /2 nd July 2004 Maui Philosophy(2) As the reservations take effect, more and more processors become idle as the planned job time approaches A scheme called backfill tries to exploit these idle processors by running short single/few processor jobs out of priority order in the gaps Maximum efficiency is achieved by scheduling big jobs first and running small jobs in the gaps ! perhaps not what the users really want ? Maui really cares about walltimes
HepSysMan 1 St /2 nd July 2004 Job Priority (1) Jobs are selected for execution in priority order Priority is calculated as a linear combination of factors based on –Credentials – who, class/queue,.. –Fair Share –Resources requested –Waiting time –Target Service level – eg maximum wait Most sites would have most coefficients set to 0
HepSysMan 1 St /2 nd July 2004 Sample Priority Component 18.104.22.168 Fairshare (FS) Component Fairshare components allow a site to favor jobs based on short term historical usage. The Fairshare Overview describes the configuration and use of Fairshare in detail. After the brief reprieve from complexity found in the QOS factor, we come to the Fairshare factor. This factor is used to adjust a job's priority based on the historical percentage system utilization of the jobs user, group, account, or QOS. This allows you to 'steer' the workload toward a particular usage mix across user, group, account, and QOS dimensions. The fairshare priority factor calculation is Priority += FSWEIGHT * MIN(FSCAP, ( FSUSERWEIGHT * DeltaUserFSUsage + FSGROUPWEIGHT * DeltaGroupFSUsage + FSACCOUNTWEIGHT * DeltaAccountFSUsage + FSQOSWEIGHT * DeltaQOSFSUsage + FSCLASSWEIGHT * DeltaClassFSUsage)) All '*WEIGHT' parameters above are specified on a per partition basis in the maui.cfg file. The 'Delta*Usage' components represents the difference in actual fairshare usage from a fairshare usage target. Actual fairshare usage is determined based on historical usage over the timeframe specified in the fairshare configuration. The target usage can be either a target, floor, or ceiling value as specified in the fairshare config file. The fairshare documentation covers this in detail but an example should help obfuscate things completely. Consider the following information associated with calculating the fairshare factor for job X.
HepSysMan 1 St /2 nd July 2004 Job Priority (2) Multiple queues/classes are but one factor in maui calculations and decisions Jobs are normally given a whole cpu or even a whole execution host Priorities are recalculated on every maui iteration – say 1 per minute Jobs selected for backfill can bypass higher priority jobs
HepSysMan 1 St /2 nd July 2004 Fairness Jobs can be given priority increments or decrements according to whether their user/group/…. s recent usage is below or above target fairshare There are a selection of throttling parameters to prevent various forms of excessive behaviour – max jobs, max submission rate,….
HepSysMan 1 St /2 nd July 2004 Reservations The administrator can set manual reservations – handy for shutting node down at particular time Standing reservations repeat – eg ScotGRID-Glasgow reserves a few nodes for short jobs 08:00 – 20:00 every day. –Backfill allows a jobs of 12 hours on these nodes during the night
HepSysMan 1 St /2 nd July 2004 Node selection Some heterogeneity in the cluster may require all processors for a job to come from some subset for best performance eg sharing a Myrinet switch. Some constraints on node selection based on ownership may be demanded Maui has additional cluster configuration settings that can define sets of execution hosts as partitions (simple member list) or as nodesets (set defined by common node feature)
HepSysMan 1 St /2 nd July 2004 Simulation Maui has a scheme for recording a usage profile over some period – eg a week The profile can then be played back with a different maui configuration in simulation mode to test new settings Quite a few under construction sections in the manual about this
HepSysMan 1 St /2 nd July 2004 Resource Allocation Manager Payment for usage Maui can interwork with the QBank resource allocation manager –http://www.emsl.pnl.gov/docs/mscf/qbank/http://www.emsl.pnl.gov/docs/mscf/qbank/ Pacific Northwest National Laboratory (PNNL) in Richland, Washington –Reserves payment before job (lien) and takes actual payment for resources used after the job May be important when cluster is funded from many sources and value for money needs to be proved
HepSysMan 1 St /2 nd July 2004 ScotGRID-Glasgow Experience (1) –OpenPBS and maui built and configured by IBMs eXtreme Cluster Administration Tool (xCAT) http://www.xcat.org xCAT is not a product – more a kit of parts supplied to IBM customers to operate Linux clusters – some Open Source xCAT includes scripts to build OpenPBS and Maui according to the xCAT scheme –Fairshares used to balance between user groups Calculated wrt an average over 7 days – decaying 20% per day Most effective with a steady demand across all users/groups – less good when job submission is more peaks and troughs
HepSysMan 1 St /2 nd July 2004 ScotGRID-Glasgow Experience (2) Standing reservation for short jobs during daytime –Currently 3 nodes with a maximum walltime of 1 hour –Intended for development/test runs –Grid monitoring test jobs –No experience yet of multiprocessor jobs, simulation, resource allocation management Bioinfomatics group demonstrated that maui has a compiled limit of 4096 on the maximum number of jobs that can be in the queue !
HepSysMan 1 St /2 nd July 2004 ScotGRID-Glasgow Experience(3) Maui Documentation is extensive but not completely comprehensive Maui is not keen on error messages Priority calculation is hard to get to grips with A misbehaving pbs_mom hangs both OpenPBS and Maui –ssh allnodes service pbs status –hope to use ganglia ( http://ganglia.sourceforge.net/ ) to spot cases where whole execution host in troublehttp://ganglia.sourceforge.net/ Ganglias gmetad (that aggregates local data) contributes a load average of ~1 on our 1 GHz PIII.. Looks like gmetad needs its own cpu
HepSysMan 1 St /2 nd July 2004 Grid(1) The EDG (and LCG?) job submission system relies on sites giving an estimate of time before a job would start to execute – FIFO behaviour Maui does not execute jobs in submission order – non FIFO behaviour RB gets an unreliable estimate
HepSysMan 1 St /2 nd July 2004 Grid(2) Gridpp have a Batch solution replacing OpenPBS with Torque and Maui – see words of Steve Traylen at –http://www.gridpp.ac.uk/tb-support/faq/torque.htmlhttp://www.gridpp.ac.uk/tb-support/faq/torque.html A Google search on Maui lcg rpm reveals many other sites getting into maui