Dynamic Deployment of VO Specific Condor Scheduler using GT4

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Massimo Sgaravatto INFN Padova.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Personal Cloud Controller (PCC) Yuan Luo 1, Shava Smallen 2, Beth Plale 1, Philip Papadopoulos 2 1 School of Informatics and Computing, Indiana University.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
22 nd Oct 2008Euro Condor Week 2008 Barcelona 1 Condor Gotchas III John Kewley STFC Daresbury Laboratory
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Sonny (Sechang) Son Computer Sciences Department University of Wisconsin-Madison Dealing with Internet Connectivity in Distributed.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
Workload Management Workpackage
HTCondor Networking Concepts
HTCondor Networking Concepts
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Peter Kacsuk – Sipos Gergely MTA SZTAKI
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
GWE Core Grid Wizard Enterprise (
CyberShake Study 16.9 Discussion
Building Grids with Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Privilege Separation in Condor
Basic Grid Projects – Condor (Part I)
Cluster Computing and the Grid, Proceedings
Genre1: Condor Grid: CSECCR
Condor: Firewall Mirroring
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
Condor-G: An Update.
Presentation transcript:

Dynamic Deployment of VO Specific Condor Scheduler using GT4 Gaurang Mehta - gmehta@isi.edu Center For Grid Technologies Information Sciences Institute/USC Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler Outline Introduction VO Problem Solution 1 : Glideins Solution 2 : Glideins via GCB Solution 3 : Condor Brick Conclusion Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Grid Job Submission Cluster Worker Nodes GT4 PBS GRAM Submit Node PBS Jobs Gram jobs GT4 PBS GRAM Submit Node (Collector, Negotiator, Master, Schedd) Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler Introduction GT4 Gram allows remote job submissions to a cluster. No scheduling capabilities of its own. Condor-G can act as a meta scheduler to submit to different clusters via GRAM. Multiple translation of job description before it is executed. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler VO Requirements The Southern California Earthquake Center (SCEC) Gathers new information about earthquakes in SoCal Integrations information into a comprehensive and predictive understanding of earthquake phenomena Want to run millions of earthquake analysis jobs simultaneously on 10’s of resources (1000’s of Cpus). Want a reliable performance over all their jobs Want to hold the allocated resources for a period of time and pipe as many jobs to the resources as they can. Want to run on wide number of resource configurations. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 1 : Glideins Cluster on a public network Cluster Worker Nodes Connect to Collector Execute Jobs PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a public network Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler Solution 1 : Glideins Glideins add resources to an existing condor setup by running condor startd daemons via GT4 GRAM on a remote scheduler and cluster. Glideins can be either setup and started by running the condor-glidein command or by writing your own condor or GRAM RSL. Pros Allows for your own personal condor cluster (VO grid) to be created Allows for VO specific policies and priorities to be applied to all the nodes in the grid. Cons Glideins are not usable when the remote resources have private IP addresses. (without additional help). Submit node becomes a bottle neck for the number of jobs that can be submitted Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 2 : Glideins via GCB Public Network Private Network Cluster Worker Nodes X Execute Jobs Connect to Collector PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a private network with outgoing connection allowed Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 2: Glideins with GCB Sometimes clusters can be behind a firewall and have only private IP addresses. Such clusters may still allow outgoing connections from each node In such a case glideins cannot work directly. The shadow process is unable to communicate directly with the starter process on the remote node. Generic Connection Broker (GCB) sits somewhere on the public network and acts as a proxy relay. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 2 : Glideins via GCB Public Network Private Network GCB Cluster Worker Nodes Open Connection to GCB Execute Jobs Connect to Collector PBS runs Glidein request Glidein request Submit Node (Collector, Master, Negotiator, Schedd) GT4 PBS GRAM Cluster on a private network with outgoing connection allowed Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 2 : Glideins with GCB Pros GCB allows the shadow and starter processes to communicate across network boundaries. Allows for a VO specific Grid to be created Allows for VO specific policies and priorities to be applied to the entire VO Grid Cons Additional overhead of maintaining and running a GCB proxy. Point of failure for all the jobs. Only works if the remote cluster setup allows outgoing connections to the public network. Submit Node is a bottle neck for the number of jobs that can be submitted Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 3: Condor Brick Some clusters with Private IP address don’t allow either incoming or outgoing connection except to the boundary servers. In such cases it is not possible to either run glideins directly or via GCB. Also both the earlier solutions don’t allow per cluster/site VO policies or priorities to be applied. (It may be possible but at least I don’t know) Welcome Condor Brick Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 3 : Condor Brick Public Network Private Network GCB Cluster Worker Nodes X Connect to Collector PBS runs Glidein request Glidein request GT4 PBS GRAM Submit Node (Collector, Master, Negotiator, Schedd) Cluster on a private network with outgoing connection allowed Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Condor Brick Condor Master Condor Collector Condor Negotiator Condor Schedd VO specific Policies and Priorities (Condor Config) Private Network Public Network Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 3: Condor Brick Bunch of condor-daemons and configuration files Condor Bricks bind to all the interfaces on the remote server Sits on the boundary and talks to both the public and private networks Dynamically deploy the brick on remote clusters using a GT4 GRAM fork job-manager on any boundary machine when needed. Dynamically glide in required cluster nodes to the brick using a GT4 <sched> job-manager. (schedd - startd communication over LAN) Use Condor-C to submit jobs to remote Condor Bricks. (schedd – schedd communication over WAN) Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 3 : Condor Brick Public Network Private Network GT4 Fork GRAM Cluster Worker Nodes CONDOR BRICK Execute Jobs Deploy Brick Connect to Brick Execute Jobs via Condor-C PBS runs Glidein request Glidein request Submit Node (Collector, Negotiator, Master, Schedd) GT4 PBS GRAM Cluster on a private network with no incoming or outgoing connections allowed Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Solution 3: Condor Brick A condor brick at each cluster ensures job load is distributed from the submit node. Condor brick at each cluster enables VO specific policies and priorities to be implemented at a per cluster/site level. Uniform scheduling system from submit node to the cluster. High throughput of jobs due to reduction in grid-overhead in each job submission. Condor bricks bind on all interfaces making this the most generic of the three solutions. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler Conclusion Deploying a dynamic Condor scheduler for the VO on each cluster results in a robust, uniform scheduling environment. A VO can control the policies and priorities for all their members on the entire VO grid or on each Cluster in the VO grid. Condor brick eliminates some of the grid latency. Condor brick allows using clusters with different network + firewall configurations. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler

Dynamic Deployment of VO Condor Scheduler Questions? Thanks to Miron and the Condor team. Condor Week 2006 Dynamic Deployment of VO Condor Scheduler