Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.
Beowulf Supercomputer System Lee, Jung won CS843.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
BMC Control-M Architecture By Shaikh Ilyas
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Layer 2 Switch  Layer 2 Switching is hardware based.  Uses the host's Media Access Control (MAC) address.  Uses Application Specific Integrated Circuits.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.
Grid Computing I CONDOR.
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Cumulus - dynamic cluster available under Clusterix
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Condor: Job Management
Using the Parallel Universe beyond MPI
Basic Grid Projects – Condor (Part I)
Threads Chapter 4.
New cluster capabilities through Checkpoint/ Restart
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An Update Paradyn/Condor Week Madison, WI 2002

Outline › Review of Dedicated/MPI Scheduling in Condor  Dedicated vs. Opportunistic  Backfill › Supported MPI Implementations › Supported Platforms › Future Work

What is MPI? › MPI is the “Message Passing Interface” › A library for writing parallel applications  Fixed number of nodes  Cannot be preempted › Lots of scientists use it for large problems › MPI is a standard with many different implementations

Dedicated Scheduling in Condor › To schedule MPI jobs, Condor must have access to dedicated resources › More and more Condor pools are being formed from dedicated resources › Few schedulers handle both dedicated and non-dedicated resources at the same time

Problems with Dedicated Compute Clusters › Dedicated resources are not really dedicated  Most software for controlling clusters relies on dedicated scheduling algorithms  Assume constant availability of resources to compute fixed schedules › Due to hardware and software failure, dedicated resources are not always available over the long-term

Look Familiar?

Two common views of a Cluster:

The Condor Solution › Condor overcomes these difficulties by combining aspects of dedicated and opportunistic scheduling into a single system  Opportunistic scheduling involves placing jobs on non-dedicated resources under the assumption that the resources might not be available for the entire duration of the jobs  This is what Condor has been doing for years

The Condor Solution (cont’d) › Condor manages all resources and jobs within a single system  Administrators only have to maintain one system, saving time and money  Users can submit a wide variety of jobs: Serial or parallel (including PVM + MPI) Spend less time learning different scheduling tools, more time doing science

Claiming Resources for Dedicated Jobs › When the dedicated scheduler (DS) has idle jobs, it queries the collector to find all dedicated resources › DS does match-making to decide which resources it wants › DS sends requests to the opportunistic scheduler to claim those resources › DS claims resources and has exclusive control (until it releases them)

Backfilling: The Problem › All dedicated schedulers leave “holes” › Traditional solution is to use backfilling  Use lower priority parallel jobs  Use serial jobs › However, if you can’t checkpoint the serial jobs, and/or you don’t have any parallel jobs of the right size and duration, you’ve still got holes

Backfilling: The Condor Solution › In Condor, we already have an infrastructure for managing non-dedicated nodes with opportunistic scheduling, so we use that to fill the holes in the dedicated schedule  Our opportunistic jobs can be checkpointed and migrated when the dedicated scheduler needs the resources again  Allows dedicated resources to be used for opportunistic jobs as needed

Specific MPI Implementations › Supported:  MPICH › Planned:  MPIPro  LAM › Others?

Condor’s MPICH Support › MPICH uses rsh to spawn jobs › Condor provides our own rsh tool  Older versions of MPICH need to be built without a hard-coded path to rsh  Newer versions of MPICH ( and later) support an environment variable, P4_RSHCOMMAND, which specifies what program should be used

Condor and MPIPro › We’ve investigated supporting MPIPro jobs with Condor › MPIPro has some issues with selecting a port for the head node in your computation, and we’re looking for a good solution

Condor + LAM = "LAMdor” › LAM's API is better suited for a dynamic environment, where hosts can come and go from your MPI universe › Has a different mechanism for spawning jobs than MPICH › Condor working to support their methods for spawning

LAMdor (Cont’d) › LAM working to understand, expand, and fully implement the dynamic scheduling calls in their API › LAM also considering using Condor’s libraries to support checkpointing of MPI computations

Other MPI implementations › What are people using? › Do you want to see Condor support any other MPI implementations? › If so, let us know by sending to:

Supported Platforms › Condor’s MPI support is now available on all Condor platforms:  Unix Linux, Solaris, Digital Unix, IRIX, HPUX  Windows (new since last year) NT, 2000

Future work (short-term) › Implementing more advanced dedicated scheduling algorithms  Integrating Condor’s user priority system with its dedicated scheduling  Adding support for user-specified job priorities (among their own jobs) › Condor-MPI support for the Tool Daemon Protocol

Future work (longer term) › Solving problems w/ MPI on the Grid  "Flocking" MPI jobs to remote pools, or even spanning pools with a single computation  Solving issues of resource ownership on the Grid (i.e. how do you handle multiple dedicated schedulers on the grid wanting to control a given resource?)

More Future work › Support for other kinds of dedicated jobs:  Generic dedicated jobs We gather and schedule the resources, then call your program, give it the list of machines, and let the program spawn itself  Linda (parallel programming interface) Gaussian (computational chemistry)

More Future work › Better support for preempting opportunistic jobs to facilitate running high-priority dedicated ones  “Checkpointing” vanilla jobs to swap space › Checkpointing entire MPI computations › MW using Condor-MPI

How do I start using MPI with Condor? › MPI support added and tested in the current development series (6.3.X) › MPI support is a built-in feature of the next stable series of Condor (6.4.X) Any Day Now™ › will be released Any Day Now™

Thanks for Listening! › Questions?  Come to the MPI “BoF”, Wednesday, 3/6/02, 11am-noon, 3385 CS › For more information:  