Presentation on theme: "POZNAÑ SUPERCOMPUTING AND NETWORKING Queueing Systems Configuration vs. Resource Management Mirosław Kupczyk"— Presentation transcript:
POZNAÑ SUPERCOMPUTING AND NETWORKING Queueing Systems Configuration vs. Resource Management Mirosław Kupczyk email@example.com
POZNAÑ SUPERCOMPUTING AND NETWORKING Agenda LSF NQE
POZNAÑ SUPERCOMPUTING AND NETWORKING LSF - overview Manage Networked Resources Run Jobs Manage Applications Control Access to System Resources Resource and Job Accounting Fault Tolerance Support for Heterogeneous Systems Checkpointing and Migration Parallel Processing
POZNAÑ SUPERCOMPUTING AND NETWORKING LSF Suite Products LSF Batch LSF JobScheduler LSF Analyzer LSF Parallel LSF MultiCluster LSF Make
POZNAÑ SUPERCOMPUTING AND NETWORKING LSF Architecture
POZNAÑ SUPERCOMPUTING AND NETWORKING Structure of LSF Batch
POZNAÑ SUPERCOMPUTING AND NETWORKING Concepts Jobs: job ID job name task or interactive task interactive batch job job report job output job errors place a job dispatch a job job states: PEND--Waiting for schedule RUN--Running DONE--Finished with zero exit value EXITED--Finished with non-zero exit value PSUSP--Pending Suspended SSUSP--Suspended by LSF USUSP--Suspended by user POST_DONE--Post-processing completed without errors POST_ERR--Post-processing completed with errors
POZNAÑ SUPERCOMPUTING AND NETWORKING Pending Jobs A job remains pending until all conditions for its execution are met. Some of the conditions are: Start time specified by the user when the job is submitted Load conditions on qualified hosts Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs Run windows during which jobs from the queue can run Limits on the number of job slots configured for a queue, a host, or a user Relative priority to other users and jobs Availability of the specified resources Job dependency and pre-execution conditions
POZNAÑ SUPERCOMPUTING AND NETWORKING Abnormal Termination of Jobs An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include: The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host. The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF. The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted. The job exits with a non-zero exit status.
POZNAÑ SUPERCOMPUTING AND NETWORKING Concepts (contd.) Hosts host types local and remote hosts master host or LSF master submission and execution hosts server host or LSF server client host or LSF client
POZNAÑ SUPERCOMPUTING AND NETWORKING Concept: Queues Queues represent a set of pending jobs, lined up in a defined order and waiting for their opportunity to use LSF resources. Queues implement different job scheduling and control policies. All jobs submitted to the same queue share the same scheduling and control policy. Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.
POZNAÑ SUPERCOMPUTING AND NETWORKING Queue: example Begin Queue QUEUE_NAME = normal PRIORITY = 30 NICE = 20 STACKLIMIT= 2048 DESCRIPTION = For normal low priority jobs, running only if hosts are lightly loaded. QJOB_LIMIT = 60 # job limit of the queue PJOB_LIMIT = 2 # job limit per processor ut = 0.2 io = 50/240 CPULIMIT = 180/hostA # 3 hours of hostA USERS = all HOSTS = all End Queue
POZNAÑ SUPERCOMPUTING AND NETWORKING Clusters Load sharing in LSF is based on clusters. A cluster is a collection of hosts running LSF. Hosts are configured centrally and managed from any machine in the LSF cluster. A cluster can contain a mixture of host types. By putting all hosts types into a single cluster, you can have easy access to the resources available on all host types. Clusters are normally set up based on administrative boundaries. LSF clusters work best when each user has an account on all hosts in the cluster, and user files are shared among the hosts so that they can be accessed from any host. This way LSF can send a job to any host. You need not worry about whether the job will be able to access the correct files. LSF can also run batch jobs when files are not shared among the hosts. LSF includes facilities to copy files to and from the host where the batch job is run, so your data will always be in the right place.
POZNAÑ SUPERCOMPUTING AND NETWORKING Clusters contd. A cluster is a group of hosts that provide shared computing resources. Hosts can be grouped into clusters in a number of ways. A cluster could contain: All the hosts in a single administrative group All the hosts on one file server or sub-network Hosts that perform similar functions If you have hosts of more than one type, it is often convenient to group them together in the same cluster. LSF allows you to use these hosts transparently, so applications that run on only one host type are available to the entire cluster.
POZNAÑ SUPERCOMPUTING AND NETWORKING first-come, first-served (FCFS) - The default type of scheduling in LSF. Jobs are considered for dispatch based on their order in the queue (FCFS). - job slot: A job slot is a bucket into which a single unit of work is assigned in the LSF system. Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.
POZNAÑ SUPERCOMPUTING AND NETWORKING LSF Daemons LIM (Load Information Manager) on each LSF server, monitors its host's load, and forwards load information to the master LIMs. LIM collects 11 built-in load indices. Master LIM is elected to store load data collected by LIMs running on hosts in the LSF cluster. On one host in the cluster, the LIM acts as the master. The master LIM runs on the master host and forwards load information to MBD. The master LIM collects information for all hosts and provides that information to the applications. The master LIM is chosen among all the LIMs running in the cluster. If the master LIM becomes unavailable, a LIM on another host will automatically take over the role of master. External LIMs are site-definable to collect up to 256 different resources. RES (Remote Execution Server) runs on each LSF server and accepts remote execution requests and provides fast, transparent and secure remote execution of interactive jobs. RES executes jobs and tasks in the background as the job owner. RES is similar to rshd (Remote Shell Daemon).
POZNAÑ SUPERCOMPUTING AND NETWORKING LSF Daemons contd. SBD (Slave Batch Daemon) runs on each LSF server, receives job requests from MBD, and starts the jobs using RES. SBD is responsible for enforcing local LSF policies and maintaining the state of jobs on the machine. MBD (Master Batch Daemon) receives job requests from LSF clients and servers and applies scheduling policies to dispatch the jobs to LSF servers in the cluster. MBD is responsible for the overall state of all jobs in the batch system. MBD keeps a file of all transactions performed on jobs throughout their lifecycle. MBD manages queues and schedules jobs on all hosts in the LSF cluster. Each cluster has one MBD, which runs on the master host. PIM (Process Information Manager) runs on each LSF server, and is responsible for monitoring all jobs and monitoring every process created for all jobs running on the server. PIM periodically walks the process tree, and accumulates memory and CPU use data which is reported to SBD. PIM provides run time resource use for all LSF jobs.
POZNAÑ SUPERCOMPUTING AND NETWORKING How LSF works 1. Receive the job. Create a job file. Return the job ID to the user. 2. During the next dispatch turn, consider the job for dispatch. 3. Place the job on the best available host. 4. Set the environment on the host. 5. Start the job.
POZNAÑ SUPERCOMPUTING AND NETWORKING Job Submission The job must be submitted to a queue. How Automatic Queue Selection Works: The criteria LSF uses for selecting a suitable queue are as follows: –User access restriction. Queues that do not allow this user to submit jobs are not considered. –Host restriction. If the job explicitly specifies a list of hosts on which the job can be run, then the selected queue must be configured to send jobs to all hosts in the list. –Queue status. Closed queues are not considered. –Exclusive execution restriction. If the job requires exclusive execution, then queues that are not configured to accept exclusive jobs are not considered. –Job's requested resources. These must be within the resource limits of the selected queue. If multiple queues satisfy the above requirements, then the first queue listed in the candidate queues (as defined by DEFAULT_QUEUE or LSB_DEFAULTQUEUE) that satisfies the requirements is selected.
POZNAÑ SUPERCOMPUTING AND NETWORKING Host Selection A number of conditions determine whether a host is eligible: Host dispatch windows Resource requirements of the job Resource requirements of the queue Host list of the queue Host load levels Job slot limits of the host.
POZNAÑ SUPERCOMPUTING AND NETWORKING Job Dispatch When a job is submitted to LSF, many factors control when and where the job starts to run: Active time window of the queue or hosts Resource requirements of the job Availability of eligible hosts Various job slot limits Job dependency conditions Fairshare constraints Load conditions
POZNAÑ SUPERCOMPUTING AND NETWORKING Fairshare Scheduling Fairshare scheduling divides the processing power of the LSF cluster among users and groups to provide fair access to resources. By default, LSF considers jobs for dispatch in the same order as they appear in the queue. If your cluster has many users competing for limited resources, the FCFS policy might not be enough. For example, one user could submit many long jobs at once and monopolize the cluster's resources for a long time, while other users submit urgent jobs that must wait in queues until all the first user's jobs are all done. To prevent this, use fairshare scheduling to control how resources should be shared by competing users. Fairshare is not necessarily equal share: you can assign a higher priority to the most important users. If there are two users competing for resources, you can: - Give all the resources to the most important user - Share the resources so the most important user gets the most resources - Share the resources so that all users have equal importance
POZNAÑ SUPERCOMPUTING AND NETWORKING Global Fairshare Global fairshare balances resource usage across the entire cluster according to one single fairshare policy. Resources used in one queue affect job dispatch order in another queue. To configure global fairshare, you must use host partition fairshare. Use the keyword all to configure a single partition that includes all the hosts in the cluster. Example Begin HostPartition HPART_NAME =GlobalPartition HOSTS = all USER_SHARES = [groupA@, 3] [groupB, 7] [default, 1] End HostPartition
POZNAÑ SUPERCOMPUTING AND NETWORKING Chargeback Fairshare Chargeback fairshare lets competing users share the same hardware resources according to a fixed ratio. Each user is entitled to a specified portion of the available resources. Example Suppose two departments contributed to the purchase of a large system. The engineering department contributed 70 percent of the cost, and the accounting department 30 percent. Each department wants to get their money's worth from the system. 1.Define 2 user groups in lsb.users, one listing all the engineers, and one listing all the accountants. Begin UserGroup Group_Name Group_Member eng_users (user6 user4) acct_users (user2 user5) End UserGroup 2.Configure a host partition for the host, and assign the shares appropriately. Begin HostPartition HPART_NAME = big_servers HOSTS = hostH USER_SHARES = [eng_users, 7] [acct_users, 3] End HostPartition
POZNAÑ SUPERCOMPUTING AND NETWORKING Priority User Fairshare Priority user fairshare gives priority to important users, so their jobs override the jobs of other users. Example A queue is shared by key users and other users. As long as there are jobs from key users waiting for resources, other users' jobs will not be dispatched. 1.Define a user group called key_users in lsb.users. 2.Configure fairshare and assign the overwhelming majority of shares to the critical users: Begin Queue QUEUE_NAME = production FAIRSHARE = USER_SHARES[[key_users@, 2000] [others, 1]]... End Queue
POZNAÑ SUPERCOMPUTING AND NETWORKING Resources Boolean resources are custom resources that describe features that may not be available or identical on all machines in a cluster. For example: Machines may have different types and versions of operating systems. Machines may play different roles in the system, such as file server or compute server. Some machines may have special-purpose devices needed by some applications. Certain software packages or licenses may be available only on some of the machines. Shared resource is a custom resource that is not tied to a specific host, but is associated with the entire cluster, or a specific subset of hosts within the cluster. Examples of shared resources include: Floating licenses for software packages Disk space on a file server which is mounted by several machines The physical network connecting the hosts
POZNAÑ SUPERCOMPUTING AND NETWORKING Resource Use Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling. LSF collects information such as: Total CPU time consumed by all processes in the job Total resident memory usage in kB of all currently running processes in a job Total virtual memory usage in kilobytes of all currently running processes in a job Currently active process group ID in a job Currently active processes in a job
POZNAÑ SUPERCOMPUTING AND NETWORKING Load Indices Collected By LIM Load indices measure the availability of dynamic, non-shared resources on hosts in the LSF cluster. Load indices are numeric in value. Load indices built into the LIM are updated at fixed time intervals. Viewing Info About Load Indices % lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostN ok 0.0 0.0 0.1 1% 0.0 1 224 43M 67M 3M hostK -ok 0.0 0.0 0.0 3% 0.0 3 0 38M 40M 7M hostG busy *6.2 6.9 9.5 85% 1.1 30 0 5M 400M 385M hostF busy 0.1 0.1 0.3 7% *17 6 0 9M 23M 28M hostV unavail
POZNAÑ SUPERCOMPUTING AND NETWORKING Checkpointing Jobs Checkpointing a job involves capturing the state of an executing job, the data necessary to restart the job, and not wasting the work done to get to the current stage. The job state information is saved in a checkpoint file. There are many reasons why you would want to checkpoint a job. Fault Tolerance Migration Load Balancing
POZNAÑ SUPERCOMPUTING AND NETWORKING Types of Checkpointing Kernel-Level Checkpointing Kernel-level checkpointing is provided by the operating system and can be applied to arbitrary jobs running on the system. This approach is transparent to the application, there are no source code changes and no need to re-link your application with checkpoint libraries. User-Level Checkpointing LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred. Application-Level Checkpointing The application-level approach applies to those applications which are specially written to accommodate the checkpoint and restart. The application checkpoints itself either periodically or in response to signals sent by other processes. When restarted, the application itself must look for the checkpoint files and restore its state.
POZNAÑ SUPERCOMPUTING AND NETWORKING MultiCluster Resource sharing among separately managed sites Multiple departments/divisions in a large corporation Computing center supporting many sites Multiple cooperating organizations Resource sharing among loosely connected sites Over long distance or slow links Across WAN with time differences
POZNAÑ SUPERCOMPUTING AND NETWORKING MultiCluster : Key Requirements Autonomy Reliability Non-shared user accounts and file systems
POZNAÑ SUPERCOMPUTING AND NETWORKING MultiCluster : Inter-Cluster Batch Job Flow MBD jobs status users inter-cluster policy inter-cluster policy agreement Master LIM Master LIM conf, load info
POZNAÑ SUPERCOMPUTING AND NETWORKING MultiCluster : Job Submission and Monitoring
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster lsf.shared Shared or replicated across clusters lsf.cluster contact hosts other hosts contact hosts other hosts
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Features in MultiCluster: –Monitoring of load and host information of remote clusters –Accessing control of inter-cluster interactive tasks –Executing batch jobs transparently in remote clusters –Account mapping between clusters
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster FEATURE lsf_multicluster lsf_ld 3.200 1-jan-0000 800 BC53D59BDA04DE12166A "Platform” Enabling MultiCluster feature (step 1): In license.dat files in local/remote clusters: Needs a FEATURE line to enable LSF MultiCluster feature in local and remote clusters.
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Begin Parameters PRODUCTS= LSF_Base LSF_Batch … LSF_MultiCluster End Parameters Begin Parameters PRODUCTS= LSF_Base LSF_Batch … LSF_MultiCluster End Parameters Enabling MultiCluster feature (step 2): In lsf.cluster.cluster-name files: Add LSF_MultiCluster keyword in the PRODUCTS line of the Parameters section. If the local cluster is only interested in certain remote cluster specified in the lsf.shared file, you can use the RemoteClusters section to limit which remote clusters the local cluster is interested in. Begin RemoteClusters CLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROM cluster2 Y 30 N End RemoteCluster Begin RemoteClusters CLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROM cluster2 Y 30 N End RemoteCluster In lsf.cluster.cluster1 Begin RemoteClusters CLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROM cluster1 N 45 Y End RemoteCluster Begin RemoteClusters CLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROM cluster1 N 45 Y End RemoteCluster In lsf.cluster.cluster2
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Enabling MultiCluster feature (step 3): In lsf.shared files (should be shared or replicated): Configure LSF Base to distribute interactive tasks across clusters. Should list the names of all clusters. The lim will read the lsf.shared file in LSF_CONFDIR for each remote cluster and save the first 10 host names listed in the Host section (One of them must be the master). Begin Cluster ClusterName # keyword cluster1 cluster2 End Cluster Begin Cluster ClusterName # keyword cluster1 cluster2 End Cluster Begin Cluster ClusterName Servers # keyword cluster1 (hostA hostB hostC) cluster2 (hostD hostE hostF hostG hostH) End Cluster Begin Cluster ClusterName Servers # keyword cluster1 (hostA hostB hostC) cluster2 (hostD hostE hostF hostG hostH) End Cluster If lsf.shared file is not shared or replicated, then it is necessary to specify a list of valid server hosts in each cluster using the option Servers in the Cluster section.
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Enabling MultiCluster feature (step 4): Begin Queue QUEUE_NAME=normal PRIORITY=30 NICE=20 SNDJOBS_TO=queue2@cluster2 queue3@cluster3 … queueN@clusterN RCVJOBS_FROM=cluster2 cluster3 … clusterN End Queue Begin Queue QUEUE_NAME=normal PRIORITY=30 NICE=20 SNDJOBS_TO=queue2@cluster2 queue3@cluster3 … queueN@clusterN RCVJOBS_FROM=cluster2 cluster3 … clusterN End Queue In lsb.queues file: Configure LSF Batch to specify the queues sharing jobs.
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Begin Queue QUEUE_NAME=normal PRIORITY=34 SNDJOBS_TO=normal@cluster2 RCVJOBS_FROM=cluster2 RES_REQ=r1m<0.9 HOSTS=hostA hostB DESCRIPTION=Multicluster queue End Queue Begin Queue QUEUE_NAME=normal PRIORITY=34 SNDJOBS_TO=normal@cluster2 RCVJOBS_FROM=cluster2 RES_REQ=r1m<0.9 HOSTS=hostA hostB DESCRIPTION=Multicluster queue End Queue lsb.queues file in cluster2 Begin Queue QUEUE_NAME=normal PRIORITY=20 SNDJOBS_TO=normal@cluster RCVJOBS_FROM=cluster1 RES_REQ=r1m<0.9 HOSTS=hostC hostD hostE DESCRIPTION=Multicluster queue End Queue Begin Queue QUEUE_NAME=normal PRIORITY=20 SNDJOBS_TO=normal@cluster RCVJOBS_FROM=cluster1 RES_REQ=r1m<0.9 HOSTS=hostC hostD hostE DESCRIPTION=Multicluster queue End Queue lsb.queues file in cluster1 Enabling MultiCluster feature (step 4): Inter-cluster job flow
POZNAÑ SUPERCOMPUTING AND NETWORKING Configuration : MultiCluster Enabling MultiCluster feature (step 5): User level account mapping (~username/.lsfhosts) : Individual users of the LSF cluster can set up their own account mapping by setting up a.lsfhosts file in their home directories. System level account mapping (lsb.users) : LSF administrator can set up system level account mapping in UserMap section. For example, userA in cluster1 to map to user_A in cluster2. cluster2 userB ~userA/.lsfhosts on hosts in cluster1 cluster1 userA ~userB/.lsfhosts on hosts in cluster2 Begin UserMap LOCAL REMOTE DIRECTION userA userB@cluster2 export userC (userD@cluster2 userE@cluster2) export … End UserMap lsb.users in cluster1 Begin UserMap LOCAL REMOTE DIRECTION userB userA@cluster1 import (userD userE) userC@cluster1 import … End UserMap lsb.users in cluster2
POZNAÑ SUPERCOMPUTING AND NETWORKING NQS Local Submission:
POZNAÑ SUPERCOMPUTING AND NETWORKING Queue Complexes A queue complex is a set of local batch queues. Each complex has a set of associated attributes, which provide for control of the total number of concurrently running requests in member queues. This, in turn, provides a level of control between queue limits and global limits. The following queue complex limits can be set: Group limits Memory limits Run limits User limits MPP processing element (PE) limits (CRAY T3D systems), or MPP application processing elements (CRAY T3E systems, or number of processors (IRIX systems) To create a queue complex (a set of batch queues), use the following qmgr command: create complex = (queuename(s)) complexname To add or remove queues in an existing complex, use the following qmgr commands: add queues = (queuename(s)) complexname remove queues = (queuename(s)) complexname
POZNAÑ SUPERCOMPUTING AND NETWORKING Qmgr IMPLEMENTATION All Cray Research systems DEC AXP systems HP 9000 systems IBM RISC system/6000 systems SGI systems SPARC systems The qmgr command provides entry to the queue manager subsystem, which allows authorized administrators to control requests, queues, and daemons associated with the Network Queuing System (NQS). Qmgr> ad[d] des[tinations] = (new_des [, new_des...]) pipe_queue [position]position = first | before old_des | after old_des | last Adds valid destinations for pipe_queue at a specific position in the existing set.
POZNAÑ SUPERCOMPUTING AND NETWORKING NLB The Network Load Balancer (NLB) provides status and control of work scheduling within the group of components in the NQE cluster. Sites can use the NLB to provide policy-based scheduling of work in the cluster. NLB collectors periodically collect data about the current workload on the machine where they run. The data from the collectors is sent to one or more NLB servers, which store the data and make it accessible to the NQE GUI Status and Load functions. The NQE GUI Status and Load functions display the status of all requests which are in the NQE cluster and machine load data.
POZNAÑ SUPERCOMPUTING AND NETWORKING Request Processing for Client Submission to NQS
POZNAÑ SUPERCOMPUTING AND NETWORKING Request Processing for Remote Submission