Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Slides:

Advertisements

Similar presentations

Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

1 Overview of Grid middleware concepts Peter Kacsuk MTA SZTAKI, Hungary Univ. Westminster, UK

Workload Management Massimo Sgaravatto INFN Padova.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Condor Birdbath Web Service interface to Condor

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Hunter of Idle Workstations Miron Livny Marvin Solomon University of Wisconsin-Madison URL:

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

Grid Workload Management Massimo Sgaravatto INFN Padova.

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.

Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Workload Management Workpackage

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Peter Kacsuk – Sipos Gergely MTA SZTAKI

Condor: Job Management

Basic Grid Projects – Condor (Part I)

Wide Area Workload Management Work Package DATAGRID project

Condor-G Making Condor Grid Enabled

Condor-G: An Update.

Presentation transcript:

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

OGSA/OGSI Super- computing Network Computing Cluster computing High-throughput computing High-performance computing Web Services CondorGlobus Client/server Progress in Grid Systems OGSA/WSRF Grid Systems

Goal: Grid technology should turn every “ordinary” user into a “supercomputing” user. Computational Grids should give us access to resources anywhere. Question: How can we make them usable by anyone? Goals of Condor

Main features of Condor Special resource management (batch) system  Distributed, heterogeneous system.  Goal: exploitation of spare computing cycles.  It can migrate sequential jobs from one machine to another.  The ClassAds mechanism is used to match resource requirements and resources

TCP/IP Resource requirement ClassAdds Match-maker Resource requestor Resource provider Publish (configuration description) Client program moves to resource(s) Security is a serious problem! The Condor model

ClassAds › Resources of the Grid have different properties (architecture, OS, performance, etc.) and these are described as advertisements (ClassAds) › Creating a job, we can describe our requirements (and preferencies) for these properties. › Condor tries to match the requirements and the ClassAds to provide the most optimal resources for our jobs.

Requirements and Ranks Requirements = Arch=="SUN4u” && OpSys == “SOLARIS251” Exact matching is needed.. Rank = Memory + Mips If there is a choice, Condor will choose the resource with bigger memory. If all the memories are the same, it will choose the faster machine.

User side: submit command file Requirements = Arch == "INTEL" && OpSys == "LINUX" Rank = Memory + Disk + Mips The two sides to match (1)

The two sides to match (2) Resource side: configuration file (owners of resources may place constrains and preferences on their machines) Friend = Owner == "haver" Trusted = Owner != "judas" Mygroup = Owner == "zoli" || Owner == "jani" Requirements = Trusted && (Mygroup || LoadAvg 10*60) Rank = Friend + MyGroup

Condor Universes  Standard  Vanilla  PVM  MPI  Globus

Standard universe › checkpointing, automatical migration for sequential jobs › Existing program should be re-linked with the Condor instrumentation library › The application cannot use some system calls (fork, socket, alarm, mmap) › Grabs file operations and passes back to the shadow process

Vanilla universe › No checkpointing, no migration › The existing executable can be used without re-compiling or re-linking › There is no restriction for system calls › NFS, or AFS is needed.

PVM universe › To run MW (Master/Worker) PVM programs › PVM extensions for task management › Dinamic Virtual Machine creation. › Support for heterogeneous environment › User can define check-points in the master process › Worker processes are net check-pointed

MPI universe › MPICH usage without any necessary changes › Dinamic changes are not supported › No check-pointing › The application cannot be suspended › NFS or AFS is needed.

A simple job description universe = vanilla executable = mathematica input = in$(Process).dat output = out$(Process).dat queue 5

Turn your workstation into a Personal Condor › Install the Condor software on your workstation: - submit - execute - scheduling services. › Submit your application to your Condor system, for example, as a “Standard” universe job

Your Personal Condor... › … provides reliable and recoverable job and resource management services ›... keeps an eye on your jobs and keeps you posted on their progress ›... implements your policy on - when the jobs can run on your workstation - the execution order of the jobs ›.. adds fault tolerance to your jobs › … keeps a log of your job activities

your workstation personal Condor jobs

Build a personal Condor Pool › Install Condor on the desk-top machine next door › Install Condor on the machines in the class room › Configure these machines to be part of your Condor pool

your workstation personal Condor jobs Group Condor

Architecture of a Condor pool Central Manager Submit Machine Execution Machines

Components of Your Condor Pool › A particular host of the pool:  Central manager Collector (scheduler) Negotiator (matchmaker) › On each host of the pool:  Controlling daemons Submit machine: Submit daemon Execution machine: Startd daemon  Checkpoint server (on Submit machine)

How does Your Condor Pool work? › You can submit a Condor job on any host of your pool (submit machine) › The submitted job registers at the central manager with its classads (requirements) › The CM performs matchmaking and selects an execution host › The matchmaker notifies the two hosts of their compatibility with respect to the particular job › The job is transferred to the execution host › A shadow process is created at the submit machine

Components of a Condor pool

Take advantage of your friends › Get permission from “friendly” Condor pools to access their resources › Configure your personal Condor to “flock” to these pools:  additional central managers of remote Condor pools can be specified as configuration parameter of schedd.  When the local pool doesn’t satisfy all its job requests, the schedd will try these remote pools in turn

your workstation friendly Condor personal Condor jobs Group Condor Your schedd daemons see the CM of the other pool as if it was part of your pool

Condor-glide-in Enable an application to dynamically turn allocated Grid resources into members of your Condor pool even if there is no Condor system on those resources.  Easy to use on different platforms  Robust  Supports SMPs  Provides uniformity

Exploit Grid resources › Get access (account(s) + certificate(s)) to a “Computational” Grid › Submit “Grid Universe” Condor-glide-in jobs to your personal Condor › Manage your glide-ins

your workstation friendly Condor personal Condor jobs Grid PBS LSF Condor Group Condor glide-ins

How does Condor glide-in work? Condo r Host CM Origin SUBD Sun ws MM Globus GRAM Host Wisconsin CS-O CS-S CS-O ScriptSUBD CS:Condor system The Origin will belong to your Condor pool

General rule for Condor pools › A Condor pool has a Central Manager › If you are flocking the Central Manager of the friendly pool becomes visible to your pool and hence hosts in the other pool can be reached › If you are gliding all the new Condor resources will directly connect to your CM

Three levels of scalability in Condor 2100 Among nodes of a cluster Grid Gliding Flocking among clusters 2100

Number of workers NUG30 - Solved!!! › Solved in 7 days instead of 10.9 years › The first 600K seconds …

Problems with Condor flocking “grids” › Friendly relationships are defined statically. › Firewalls are not allowed between friendly pools. › Client can not choose resources (pools) directly. › Non-standard “Condor protocols” are used to connect friendly pools together. › The client needs an account to be able to submit a job and get the results back. › Not service-oriented

Thank you