Wolfgang Friebel, 15.10.2001 HEPiX Meeting Berkeley Installing and Running SGE at DESY (Zeuthen)

Slides:



Advertisements
Similar presentations
Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Configuration management
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
DCC/FCUP Grid Computing 1 Resource Management Systems.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Chapter 11 Operating Systems
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
Computer Organization and Architecture
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
1 Operating Systems Ch An Overview. Architecture of Computer Hardware and Systems Software Irv Englander, John Wiley, Bare Bones Computer.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Part two. 3.2 operating system architecture  Software have two categories  Application software  System software  Application software: consists of.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Fundamentals of Networking Discovery 1, Chapter 2 Operating Systems.
Hands-On Microsoft Windows Server 2008
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Wolfgang Friebel, April AFS Administration Framework.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Mastering the AS/400, Third Edition, author Jerry Fottral 1 Week 2 The System The AS/400 is a multi-user, multi-tasking system -- a system on which many.
CHEP 2000 Smart Resource Management Software in High Energy Physics Wolfgang Gentzsch and Lothar Lippert Gridware GmbH & Inc. Padua, 9 February 2000.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Nov 1, 2000Site report DESY1 DESY Site Report Wolfgang Friebel DESY Nov 1, 2000 HEPiX Fall
Transparent Process Migration: Design Alternatives and the Sprite Implementation Fred Douglis and John Ousterhout.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Introduction to AFS IMSA Intersession 2003 AFS Servers and Clients Brian Sebby, IMSA ‘96 Copyright 2003 by Brian Sebby, Copies of these.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Wolfgang Friebel, C5 report HEPiX Fall 2001 Report (2) NERSC, Berkeley.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
August 28, 1998Handling requests with a trouble ticket system at DESY Zeuthen1 Wolfgang Friebel Motivation The req/reqng request tracking system Enhancements.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
2: Operating Systems Networking for Home & Small Business.
7.1 Operating Systems. 7.2 A computer is a system composed of two major components: hardware and software. Computer hardware is the physical equipment.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Operating System (Reference : OS[Silberschatz] + Norton 6e book slides)
Advanced Computing Facility Introduction
OpenPBS – Distributed Workload Management System
GWE Core Grid Wizard Enterprise (
Chapter 2: System Structures
Artem Trunov and EKP team EPK – Uni Karlsruhe
Real-time Software Design
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Basic Grid Projects – Condor (Part I)
Sun Grid Engine - A Batch System for DESY
Sun Grid Engine - A Batch System for DESY
Sun Grid Engine.
Periodic Processes Chapter 9.
Presentation transcript:

Wolfgang Friebel, HEPiX Meeting Berkeley Installing and Running SGE at DESY (Zeuthen)

Oct 15, 2001SGEEE Zeuthen2 Introduction  Motivations for using a batch system  more effective usage of available computers (e. g. more uniform load)  usage of resources 24h/day  assignment of resources according to policies (who gets how much CPU when)  quicker execution of tasks (system knows most powerful least loaded nodes)  Our goal: You tell the batch system a script name and what you need in terms of disk space, memory, CPU time The batch system guarantees fastest possible turnaround  Could even be used to get xterm windows on least loaded machines for interactive use

Oct 15, 2001SGEEE Zeuthen3 Batch Systems Overview  Condor targeted at using idle workstations (not used at DESY)  NQS public domain and commercial versions, basic functionality. Used for APE100 projects  Loadleveler mostly found on IBM machines, used at DESY  LSF popular, rich set of features, licensed software, used at DESY  PBS public domain and commercial versions, origin: NASA rich set of features, became popular recently, used in H1  Codine/GRD batch system similar to LSF in functionality, used in HERA-B and for all farms at DESY Zeuthen  SGE/SGEEE Sun Grid Engine (Enterprise Edition), open source successors of Codine/GRD. Became the only batch system at Zeuthen (except for the legacy APE 100 batch system)

Oct 15, 2001SGEEE Zeuthen4 The old Batch System Concept  Each group runs a separate cluster with separate instances of GRD or Codine  Project priorities within a group are maintained by configuring several queues reflecting the priorities  Queue names were named after priority, e.g. long, medium, short, idle,...  Could also be named according to task, e.g. simulation, production, test,...  Individuals had to obey group dependent rules to submit jobs  Priorities between different groups were realized by the cluster size (CPU power)  Urgent tasks were tried to carry out by asking other groups to temporarily use their cluster  Administrative overhead to enable accounts on machines  Users had to adapt their batch jobs to the new environment There were always heavily overloaded clusters next to machines with lots of idle CPU cycles

Oct 15, 2001SGEEE Zeuthen5 A new Scheme for Batch Processing  Two factors led us design a new batch processing scheme  shortcomings of the old system, especially the non uniform usage pattern  licensing situation, our GRD license ended, wanted to go to the open source successor of GRD  One central batch system for all groups  dynamic allocation of resources according to the current needs of groups  more uniform configuration of batch nodes  Very few queue types  basically only two types: Queue for ordinary batch jobs and idle queue  most of the scheduling decisions based on other mechanisms (see below)  Resource requests for jobs determine queuing  Resource definition based on the concept of complexes (explained later)  User should request resources if the defaults are not well suited for the jobs  Bookkeeping of resources within the batch system

Oct 15, 2001SGEEE Zeuthen6 The Sun Grid Engine Components Components of the system  Queues contain information on number of jobs and job characteristics that are allowed on a given host. Jobs need to fit into a queue to get executed. Queues are bound to specific hosts.  Resources Features of hosts or queues that are known to SGE. Resource attributes are defined in so called (global, host, queue and user defined) complexes  Projects contain lists of users (usersets) that are working together. The relative importance to other projects may be defined using shares.  Policies Algorithms that define, which jobs are scheduled to which queues and how the priority of running jobs has to be set. SGEEE knows functional, share based, urgency based and override policies  Shares SGEEE can use a pool of tickets to determine the importance of jobs. The pool of tickets owned by a project/job etc. is called share

Oct 15, 2001SGEEE Zeuthen7 Benefits Using the SGEEE Batch System  For users:  jobs get executed on the most suitable (least loaded, fastest) machine  fair scheduling according to defined sharing policies  no one else can overuse the system and provoke system degradation  users need no knowledge of host names where their jobs can run  quick access to load parameters of all managed hosts  For administrators:  one time allocation of resources to users, projects, groups  no manual intervention to guarantee policies  reconfiguration of the running system (to adapt to changing usage pattern)  easy monitoring of hosts and jobs

Oct 15, 2001SGEEE Zeuthen8 Policies for the Job handling within SGEEE  Within SGEEE tickets are used to distribute the workload  User based functional policy  Tickets are assigned to projects, users and jobs. More tickets mean higher priority and faster execution (if concurrent jobs are running on a CPU)  Share based policy  Certain fractions of the system resources (shares) can be assigned to projects and users.  Projects and users receive that shares during a configurable moving time window (e.g. CPU usage for a month based on usage during the past month)  Deadline policy  By redistributing tickets the system can assign jobs an increasing weight to meet a certain deadline. Can be used by authorized users only  Override policy  Sysadmins can give additional tickets to jobs, users or projects to temporarily adjust their relative importance.

Oct 15, 2001SGEEE Zeuthen9 Classes of Hosts and Users  Submit Host node that is allowed to submit jobs (qsub) and query its status  Exec Host node that is allowed to run (and submit) jobs  Admin Host node from which admin commands may be issued  Master Host node controlling all SGE activity, collecting status information, keeping access control lists etc. A certain host can have any mixture of the roles above  Administrator user that is allowed to fully control SGE  Operator user with admin privileges, who is not allowed to change the queue configuration  Owner user that is allowed to suspend jobs in queues he owns or disable owned queues  User can manipulate only his own jobs

Oct 15, 2001SGEEE Zeuthen10 The Zeuthen SGEEE Installation  SGEEE built from the source with AFS support  Another system (SGE with AFS) was built for the HERA-B experiment  Two separate clusters (no mix of operating systems)  95 Linux nodes in default SGEEE cell  Other Linux machines (public login) used as submit hosts  17 HP-UX nodes in cell hp A cell is a separate pool of nodes controlled by a master node

Oct 15, 2001SGEEE Zeuthen11 The Zeuthen SGEEE Installation  In production since 9/2001  Smooth migration from the old system  Two batch systems were running in parallel for a limited time  Coexistence of old queue configuration scheme and the new one  Ongoing tuning of the new system  Initial goal was to reestablish functionality of the old system  Now step by step changes towards a truly homogeneous system  Initially some projects were bound to subgroups of hosts

Oct 15, 2001SGEEE Zeuthen12 Our Queue Concept  one queue per CPU with large time limit and low priority  users have to specify at least a CPU time limit (usually much smaller)  Users can request other resources (memory, disk) differing from default values  optionally a second queue that gets suspended as soon as there are jobs in the first queue (idle queue)  interactive use is possible because of low batch priority  relation between jobs, users and projects is respected because of sharing policies

Oct 15, 2001SGEEE Zeuthen13 Complexes within SGE  Complexes are containers for resource definitions  Resources can be requested by a batch job  You can have hard requests that need to be fulfilled (e.g. host architecture)  Soft requests are fulfilled if possible  The actual value for some resource parameters is known  Amount of available main memory or disk space can be used for decisions  Arbitrary "load sensors" can be written to measure resource parameters  Resources can be reserved for the current job  Parameters can be made "consumable". A portion of a requested resource gets subtracted from the value of the currently available resource parameter  The most important parameters are known to SGEEE  Parameters like CPU time, virtual free memory etc. are built in already  To be used some of them need to be activated in the configuration

Oct 15, 2001SGEEE Zeuthen14 Our Complexes Concept  Users have to specify for a job  Time limit (CPU time)  Users can request for a job  A certain amount of virtual and real free memory  The existence of one or two scratch disks (coming soon):  The available free disk space for a given scratch disk  To have a guaranteed amount of disk space reserved  More hardware oriented features like:  Using only machines from a subcluster (farm)  Run on a specific host (not recommended)

Oct 15, 2001SGEEE Zeuthen15 Experiences  System is easily useable from a users point of view  System is highly configurable (needs some time to find the optimum policies to implement)  System is very stable  crashing jobs mostly due to failing token renewal (our plugin procedure based on arc and batchtkauth)  other failures due to missing (on purpose!) path aliases for the automounter  System adapts dynamically process priority to meet share policies or to keep up with changing policies  SGE(EE) maintainers are very active and keep implementing new ideas  quick incorporation of patches, reported bugs get fixed asap.

Oct 15, 2001SGEEE Zeuthen16 Advanced Use of SGEEE  Using the perl API  every aspect of the batch system is accessible through the perl API  the perl API is accessible after use SGE; in perl scripts  there is almost no documentation but a few sample scripts in /afs/ifh.de/user/f/friebel/public and in /afs/ifh.de/products/source/gridengine/source/experimental/perlgui  Using the load information reported by SGEEE  each host reports a number of load values to the master host (qmaster)  there is a default set of load parameters that are always reported  further parameters can be reported by writing load sensors  qhost is a simple interface to display that information  a powerful monitoring system could be built around that feature, which is based on the "Performance Data Collection" (PDC) built in subsystem

Oct 15, 2001SGEEE Zeuthen17 Conclusions  Ease of installation from source  Access to source code  Chance of integration into a monitoring system  API for C and Perl  Excellent load balancing mechanisms  Managing the requests of concurrent groups  Mechanisms for recovery from machine crashes  Fallback solutions for dying daemons  Weakest point is AFS integration and Token prolongation mechanism (basically the same code as for Loadleveler and for older LSF versions)

Oct 15, 2001SGEEE Zeuthen18 Conclusions  SGEEE has all ingredients to build a company wide batch infrastructure  Allocation of resources according to policies ranging from departmental policies to individual user policies  Dynamic adjustment of priorities for running jobs to meet policies  Supports interactive jobs, array jobs, parallel jobs  Can be used with Kerberos (4 and 5) and AFS  SGEEE is open source maintained by Sun  Getting deeper knowledge by studying the code  Can enhance the code (examples: more schedulers, tighter AFS integration, monitoring only daemons)  Code is centrally maintained by a core developer team  Could play a more important role in HEP (component of a grid environment, open industry grade batch system as recommended solution within HEPiX?)

Oct 15, 2001SGEEE Zeuthen19 References  Download Page for source code of SGE(EE)  lots of docs from raytheon  Supportforum, Mailinglists  S/FERSTL/INDEX.HTM GRD on a Conference 1998  Zeuthen pages with URL to the reference manual  The SGEEE reference manual, user and installation guide

Oct 15, 2001SGEEE Zeuthen20 Technical Details of SGEEE (not presented)  Submitting Jobs  The graphical interface qmon  Job submission and file systems  Sample job script  Advanced usage of qsub  Abnormal job termination

Oct 15, 2001SGEEE Zeuthen21 Submitting Jobs  Requirements for submitting jobs  have a valid token (verify with tokens ), otherwise obtain a new one (k log )  ensure that in your.[t]cshrc or.zshrc no commands are executed that need a terminal (tty) (users have often a stty command in their startup scripts)  you are within batch if the env variable JOB_NAME is set or if the env variable ENVIRONMENT is set to BATCH  Submitting a job  specify what resources you need (-l option) and what script should be executed qsub -l t=1:00:00 job_script  in the simplest case the job script contains 1 line, the name of the executable  many more options available  alternatively use the graphical interface to submit jobs qmon &

Oct 15, 2001SGEEE Zeuthen22 The Submit Window of qmon

Oct 15, 2001SGEEE Zeuthen23 Job Submission and File Systems  Current working directory  the directory from where the qsub command was called. STDOUT and STDERR of a job go into files that are created in $HOME. Because of quota limits and archiving policies that is not recommended.  With the -cwd option to qsub the files get created in the current working directory. For performance reasons that should be on a local file system  If cwd is in NFS space, the batch system must not use the real mount point but be translated according to /usr/SGE/default/common/sge_aliases. As every job stores the full info from sge_aliases, it is of advantage to get rid of that file and discourage the use of NFS as current working directory  If required, create your own $HOME/.sge_aliases file  Local file space (Zeuthen policies)  /usr1/tmp is guaranteed to exist on all linux nodes and has typically > 10GB  /data exists on some linux nodes and has typically > 15GB capacity. A job can request the existence of /data by -l datadir  $TMP[DIR] is a unique directory below /usr1/tmp, that gets erased at the end of the job. Normal jobs should make use of that mechanism if possible

Oct 15, 2001SGEEE Zeuthen24 A Simple Job Script #!/bin/zsh #$ -S /bin/zsh # #$ -l t=0:30:00 #$ -j y WORKDIR=/usr1/tmp/$LOGNAME/$JOB_ID DATADIR=/net/ilos/h1data7 echo using working directory $WORKDIR mkdir -p $WORKDIR cp $DATADIR/large_input $WORKDIR cd $WORKDIR h1_reco cp large_out $DATADIR if [ -s large_out = -s $DATADIR/large_out ]; then cd; rm -r $WORKDIR fi otherwise the default shell would be used the time limit for this job

Oct 15, 2001SGEEE Zeuthen25 Advanced Usage of qsub  Option files  instead of giving qsub options on the command line, users may store those in.sge_projects files in their $HOME or current working directories  content of a sample.sge_projects file: cwd -S /usr/local/bin/perl -j y -l t=24:00:00  Array jobs  SGE allows to schedule n identical jobs with one qsub call using the –t option: qsub -t 1-10 array_job_script  within the script use the variable SGE_TASK_ID to select different inputs and write to distinct output files (SGE_TASK_ID is in the example above)  Conditional job execution  jobs can be scheduled to wait for dependent jobs to successfully finish (rc=0)  jobs can be submitted in hold state (needs to be released by user or operator)  jobs can be told not to start before a given date  start dependent jobs on the same host (using qalter -q $QUEUE... within script)

Oct 15, 2001SGEEE Zeuthen26 Abnormal Job Termination  Termination because of CPU limit exceeded  jobs get an XCPU signal that can be catched by the job. In that case termination procedures can be executed, before the SIGKILL signal is sent  SIGKILL will be sent a few minutes after XCPU was sent. It cannot be catched.  Restart after execution host crashes  if a host crashes when a given job is running, the job will be restarted. In that case the variable RESTARTED is set to 1  The job will be reexecuted from the beginning on any free host. If the job can be restarted using some results achieved so far, then the variable RESTARTED can be checked. The job can be forced to be executed on the same host by inserting qalter -q $QUEUE $JOB_ID literally in the job script  Signaling the end of the job  with the qsub option -notify a SIGUSR1 signal is sent to the job a few minutes before the job is suspended or terminated