Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

High Availability Deep Dive What’s New in vSphere 5 David Lane, Virtualization Engineer High Point Solutions.
The Future of HTCondor's Networking or: How I Learned to Stop Worrying and Love IPv6 Alan De Smet
1 Dynamic DNS. 2 Module - Dynamic DNS ♦ Overview The domain names and IP addresses of hosts and the devices may change for many reasons. This module focuses.
15.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 15: Configuring a Windows.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Hussain Ali Department of Computer Engineering KFUPM, Dhahran, Saudi Arabia Microsoft Networking.
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
1 Chapter Overview Introduction to Windows XP Professional Printing Setting Up Network Printers Connecting to Network Printers Configuring Network Printers.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Ch 8-3 Working with domains and Active Directory.
Working with Drivers and Printers Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Understanding Drivers and Devices Install and configure.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
April Open Science Grid Building a Campus Grid Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Session 10 Windows Platform Eng. Dina Alkhoudari.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Objectives Configure routing in Windows Server 2008 Configure Routing and Remote Access Services in Windows Server 2008 Network Address Translation 1.
Module 7: Fundamentals of Administering Windows Server 2008.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Learningcomputer.com SQL Server 2008 Configuration Manager.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Module 11: Implementing ISA Server 2004 Enterprise Edition.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
A Brief Documentation.  Provides basic information about connection, server, and client.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Overview of the Automated Build & Deployment Process Johnita Beasley Tuesday, April 29, 2008.
High Availability in DB2 Nishant Sinha
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Introduction to Routers
Module 4: Design IIS Maintenance and UDDI. Designing Internet Information Services Backup and Recovery Specifying Monitoring requirements Deploying UDDI.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
WEEK 11 – TOPOLOGIES, TCP/IP, SHARING & SECURITY IT1001- Personal Computer Hardware System & Operations.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring Windows Server 2008 Printing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
VMware Certified Professional 6-Data Center Virtualization Beta 2V0-621Exam.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
11 DEPLOYING AN UPDATE MANAGEMENT INFRASTRUCTURE Chapter 6.
Troubleshooting Directories and Files Debugging
HTCondor Networking Concepts
HTCondor Networking Concepts
HTCondor Security Basics
Network Load Balancing
High Availability in HTCondor
Cluster Communications
Active Directory Administration
Adding High Availability to Condor Central Manager Tutorial
Intuit has launched QuickBooks File Doctor tool (QBFD) in QuickBooks File Doctor is a tool that has been designed to recover the damaged company.
Introduction to Makeflow and Work Queue
Lesson 16-Windows NT Security Issues
Duplicating a Database
Radoslaw Jedynak, PhD Poland, Technical University of Radom
Presentation transcript:

Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability to Condor Central Manager Tutorial

2 Introduction to HA › Multiple Collectors run simultaneously on each CM machine › All submission and execution machines must be configured to report to all CMs › HAD – HA daemon runs on each CM › HAD makes sure a single Negotiator runs on one of the CMs

3 Basic Scenario Collector Leader HAD Collector HAD Collector HAD Negotiator Active CMIdle CM I ’ m alive Workstation – Startd and Schedd

4 › HA mechanism must be explicitly enabled

5 HAD_LIST › List of server machines, were the HA daemons (HAD) will be installed, configured and run › Each element in the list is composed of IP or hostname, and a port number, separated by a colon. Elements are separated from each other using commas › HAD_LIST should be identical on all CM machines › HAD_LIST should be identical (ports excluded) to the COLLECTOR_HOST list, and in the same order.

6 HAD_USE_PRIMARY › One HAD could be declared as primary › Primary HAD is always guaranteed to be elected as active CM, as long as it is alive › After primary recovers, it will become active CM, substituting one of its backups › In case HAD_USE_PRIMARY =true the first element in the HAD_LIST will be the primary HAD. In that case, the rest of the daemons will serve as a backups › Default is false

7 HAD_CONNECTION_TIMEOUT › An upper bound on the time (in seconds) it takes for HAD to establish a TCP connection › Recommended value is 2 seconds › Default is 5 seconds › Effects Stabilization time - the time it takes for HA daemons to detect failure and fix it › Stabilization time = 12*#CMs*HAD_CONNECTION_TIMEOUT

8 HAD_ARGS › HAD_ARGS = -p › HAD_PORT should be identical to the port defined in HAD_LIST for that host › Allows master to start HAD on a specified command port › No default value. This one is a must

9 Regular daemon configuration › HAD – path to condor_had binary › HAD_LOG – path to the log file › MAX_HAD_LOG – maximum size of the log file › HAD_DEBUG – logging level for condor_had

10 Influenced configuration variables › On both client (schedd + startd) and CM machines:  COLLECTOR_HOST- list of CM machines  HOSTALLOW_NEGOTIATOR – must include all CM machines

11 Influenced configuration variables › Only on Schedd machines:  HOSTALLOW_NEGOTIATOR_SCHEDD - must include all CM machines › Only on CM machines:  DAEMON_LIST – must include Collector, Negotiator, HAD  DC_DAEMON_LIST - must include Collector, Negotiator, HAD  HOSTALLOW_ADMINISTRATOR – CM machine must have administrative privileges (in order to turn Negotiator on and off)

12 Configuration Files

13 Deprecated variables › #unset these variables - they are deprecated › NEGOTIATOR_HOST= › CONDOR_HOST=

14 condor_config.local. ha_central_manager › CENTRAL_MANAGER1= cm1.wisc.edu › CENTRAL_MANAGER2= cm2.wisc.edu › COLLECTOR_HOST = $(CENTRAL_MANAGER1),$(CENTRAL_MANAGER2)

15 › HAD_PORT = › HAD_LIST = $(CENTRAL_MANAGER1):$(HAD_PORT), $(CENTRAL_MANAGER2):$(HAD_PORT) › HAD_ARGS = -p $(HAD_PORT) › HAD_CONNECTION_TIMEOUT = 2 › HAD_USE_PRIMARY = true › HAD = $(SBIN)/condor_had › MAX_HAD_LOG = › HAD_DEBUG = D_COMMAND › HAD_LOG = $(LOG)/HADLog condor_config.local. ha_central_manager

16 › DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, HAD › DC_DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, HAD › HOSTALLOW_NEGOTIATOR = $(COLLECTOR_HOST) › HOSTALLOW_ADMINISTRATOR = $(COLLECTOR_HOST) condor_config.local. ha_central_manager

17 condor_config.local. ha_client › CENTRAL_MANAGER1= cm1.wisc.edu › CENTRAL_MANAGER2= cm2.wisc.edu › COLLECTOR_HOST = $(CENTRAL_MANAGER1),$(CENTRAL_MANAGER2) › HOSTALLOW_NEGOTIATOR = $(COLLECTOR_HOST) › HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST)

18 Disabling HA mechanism › Remove HAD and NEGOTIATOR from DEAMON_LIST on all machines › Leave one NEGOTIATOR in DEAMON_LIST on one machine › condor_restart CM machines › Or turn off running HA mechanism:  condor_off –all –negotiator  condor_off –all –had  condor_on –negotiator on one machine

19 Configuration Sanity Check script › Checks that all HA-related configuration parameters of RUNNING pool are correct  HAD_LIST consistent on all CMs  HAD_CONNECTION_TIMEOUT consistent on all CMs  COLLECTOR_HOST consistent on all machines and corresponds to HAD_LIST  DAEMON_LIST contains HAD, COLLECTOR, NEGOTIATOR  HAD_ARGS is consistent with HAD_LIST  HOSTALLOW_NEGOTIATOR and HOSTALLOW_ADMINISTRATOR are set correct

20 Backward Compatibility › Non-upgraded client machines will run fine as long as the machine that served as Central Manager before the upgrade is configured as primary CM › Non-upgraded client machines will of course not benefit from CM failover

21 FAQ › Reconfigure and restart all your pool nodes, not only CMs › Run sanity check scrip › Condor_off –neg will actively shut down the Neg. No HA is provided › In case primary CM failed, it takes more time for tools to return results. This is since they query the Collectors in order of COLLECTOR_HOST › More than one Neg can be noticed at the beginning for very short time