Presentation is loading. Please wait.

Presentation is loading. Please wait.

Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput.

Similar presentations


Presentation on theme: "Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput."— Presentation transcript:

1 Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu http://www.cs.wisc.edu/~miron High-Throughput Computing on Commodity Systems.

2 www.cs.wisc.edu/condor The Good News: Raw computing power is everywhere - on desk-tops, shelves, racks, and in your pockets. It is:  Cheap  Plentiful  Mass-Produced

3 www.cs.wisc.edu/condor The Bad News: GFLOPS per year =/= GFLOPS per second * 30,000,000 seconds/year

4 www.cs.wisc.edu/condor A variation on a chestnut: What is a benchmark?

5 www.cs.wisc.edu/condor Answer: The throughput which your system is guaranteed never to exceed!

6 www.cs.wisc.edu/condor Why? › A community of commodity computers can be difficult to manage:  Dynamic : State and availability change over time  Evolving : New hardware and software is continuously acquired and installed  Heterogeneous : Hardware and software  Distributed ownership : Each machine has a different owner with different requirements and preferences.

7 www.cs.wisc.edu/condor Why? › Even traditionally “static” systems (such as professionally managed clusters) suffer the same problems when viewed at a yearly scale:  Power failures  Hardware failures  Software upgrades  Load imbalance  Network imbalance

8 www.cs.wisc.edu/condor How do we measure computer performance? › High-Performance Computing:  Achieve max GFLOP per second under ideal circumstances. › High-Throughput Computing  Achieve max GFLOP per months or years in whatever conditions prevail.

9 www.cs.wisc.edu/condor High-Throughput Computing › Focuses on maximizing…  simulations run before the paper deadline…  crystal lattices per week…  reconstructions per week…  video frames rendered per year… › …without “babysitting” from the user. › Cannot depend on “ideal” circumstances.

10 www.cs.wisc.edu/condor High-Throughput Computing › Is achieved by:  Expanding the CPUs available.  Silently adapting to inevitable changes.  Robust software › Is only marginally affected by:  MB, MHz, MIPS, FLOPS…  Robust hardware

11 www.cs.wisc.edu/condor Solution: Condor › Condor is software for creating a high-throughput computing environment on a community of workstations, ranging from commodity PCs to supercomputers.

12 www.cs.wisc.edu/condor Who are we?

13 www.cs.wisc.edu/condor The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces  software engineering challenges in a UNIX/Linux/NT environment,  active interaction with users and collaborators,  daily maintenance and support challenges of a distributed production environment,  and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School.

14 www.cs.wisc.edu/condor Users and collaborators › Scientists - Biochemistry, high energy physics, computer sciences, genetics, … › Engineers - Hardware design, software building and testing, animation,... › Educators - Hardware design tools, distributed systems, networking,...

15 www.cs.wisc.edu/condor National Grid Efforts › National Technology Grid - NCSA Alliance (NSF-PACI) › Information Power Grid - IPG (NASA) › Particle Physics Data Grid - PPDG (DoE) › Grid Physics Network GriPhyN (NSF- ITR)

16 www.cs.wisc.edu/condor Condor CPUs on the UW Campus

17 www.cs.wisc.edu/condor Some Numbers:UW-CS Pool 6/98-6/00 4,000,000hours ~450 years “Real” Users1,700,000hours ~260 years CS-Optimization610,000hours CS-Architecture350,000hours Physics245,000hours Statistics80,000hours Engine Research Center38,000hours Math90,000hours Civil Engineering27,000hours Business970hours “External” Users165,000hours ~19 years MIT76,000hours Cornell38,000hours UCSD38,000hours CalTech18,000hours

18 www.cs.wisc.edu/condor Start slow, but think BIG

19 www.cs.wisc.edu/condor Start slow, but think big! One Personal Condor Condor Pool Condor-G 1 machine on your desktop 100 machines in your department 1000 machines in the GRID.

20 www.cs.wisc.edu/condor Start slow, but think big! › Personal Condor:  Manage just your machine with Condor. Fault tolerance, policy control, logging. Sleep soundly at night. › Condor Pool:  Take advantage of your friends and colleagues: share cycles, gain ~ 100x throughput. › Condor-G:  Jobs from your pool migrate to other computational facilities around the world. Gain 1000x throughput. (Record-breaking results!)

21 www.cs.wisc.edu/condor Key Condor User Services › Local control - jobs are stored and managed locally by a personal scheduler. › Priority scheduling - execution order controlled by priority ranking assigned by user. › Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed. › Local executing environment preserved - re- linked jobs can have their I/O re-directed to submission site.

22 www.cs.wisc.edu/condor More Condor User Services › Powerful and flexible means for selecting execution site (requirements and preferences) › Logging of job activities. › Management of large (10K) numbers of jobs per user. › Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager) › Support for dynamic MW (PVM and File) applications

23 www.cs.wisc.edu/condor How does it work?

24 www.cs.wisc.edu/condor Basic HTC Mechanisms › Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). › Fault tolerance - Checkpointing enables preemptive resume scheduling ( go ahead and use it as long as it is available!). › Remote execution – enables transparent access to resources from any machine in the world. › Asynchronicity - enables management of dynamic (opportunistic) resources.

25 www.cs.wisc.edu/condor Every Community needs a Matchmaker!

26 www.cs.wisc.edu/condor Why? Because..... someone has to bring together community members who have requests for goods and services with members who offer them.  Both sides are looking for each other  Both sides have constraints  Both sides have preferences

27 www.cs.wisc.edu/condor ClassAd - Properties Type=“Machine”; Activity=“Idle”; KbdIdle=‘00:22:31’; Disk=2.1G;//2.1 Gigs Memory=64M; // 6.4 Megs State=“Unclaimed”; LoadAverage=0.042969 Arch=“INTEL”; OpSys=“SOLARIS251”;

28 www.cs.wisc.edu/condor ClassAd - Policy RsrchGrp={ “raman”, “miron”, “solomon” }; Friends={ “dilbert”, “wally” }; Untrusted={ “rival”, riffraff”, TPHB” }; Tier=member(RsrchGroup, other.Owner) ? 2 : ( member(Friends, other.Owner) ? 1 : 0 ) Requirements=!member(Untrusted, other.Owener) && (Tier == 2 ? True : Tier == 1 ? LoadAvg < 0.3 && KbdIdle > ‘00:15’ ) : DayTime() ’18:00’ )

29 www.cs.wisc.edu/condor Advantages of Matchmaking 4 Hybrid (Centralized+Distributed) resource allocation algorithm 4 End-to-end verification 4 Bilateral specialization 4 Weak consistency requirements 4 Authentication 4 Fault tolerance 4 Incremental system evolution

30 www.cs.wisc.edu/condor Fault-Tolerance › Condor can checkpoint a program by writing its image to disk. › If a machine should fail, the program may resume from the last checkpoint. › Ifa job must vacate a machine, it may resume from where it left off.

31 www.cs.wisc.edu/condor Remote Execution › Condor might run your jobs on machines spread around the world – not all of them will have your files. › Condor provides an adapter – a library – which converts your job’s I/O operations into remote I/O back to your home machine. › No matter where your job runs, it sees the same environment.

32 www.cs.wisc.edu/condor Asynchronicity › A fact of life in a system of 1000s of machines.  Power on/off  Lunch breaks  Jobs start and finish › Condor never depends on a fixed configuration – work with what is available.

33 www.cs.wisc.edu/condor Does it work?

34 www.cs.wisc.edu/condor An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality. The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

35 www.cs.wisc.edu/condor NUG30 Personal Condor … For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

36 www.cs.wisc.edu/condor It works!!! Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth To: Miron Livny Subject: Re: Priority This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great… Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth Still rolling along. Over three billion nodes in about 1 day!

37 www.cs.wisc.edu/condor Up to a Point … Date : Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth Hi Gang, The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

38 www.cs.wisc.edu/condor Back in Business Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth Hi Gang, We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

39 www.cs.wisc.edu/condor The First 600K seconds …

40 www.cs.wisc.edu/condor We made it!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

41 www.cs.wisc.edu/condor Do not be picky, be agile!!!


Download ppt "Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison High-Throughput."

Similar presentations


Ads by Google