Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Campus Grid with Existing Resources LabMan Conference, Notre Dame June 8-9, 2009 Preston Smith Purdue University.

Similar presentations


Presentation on theme: "Building a Campus Grid with Existing Resources LabMan Conference, Notre Dame June 8-9, 2009 Preston Smith Purdue University."— Presentation transcript:

1 Building a Campus Grid with Existing Resources LabMan Conference, Notre Dame June 8-9, 2009 Preston Smith Purdue University

2 Special Thanks Thanks to the Condor Team at Wisconsin for graciously allowing us to borrow from their tutorial materials!

3 Outline Supercomputers on Campus –Campus Grids –High-Throughput Computing –The impact of the campus grid The Condor Software –Condor 101, at 200 mph Condor from an administrator’s view –Policies –Networking –Security –Virtual Appliance

4 Campus Grids Campus grids link computing resources within universities and research institutes, often including geographically distributed sites. –Dedicated computing resources –Idle non-dedicated computing resources Workstations Student Labs –Campus grids build computation resources out of an institution’s existing investment in computer resources

5 Supercomputers on Campus Purdue’s Campus Grid currently has 23,000 cores –There are only 21 systems on the 11/2008 Top 500 list with 16,000 or more cores. Theoretical peak capacity of the campus grid is 177 Teraflops –This would place at #12 on the 11/2008 Top 500 list Acquiring a resource of this scale is expensive! –$3 million for compute nodes alone –Requires 2000 square feet of floor space, plus power and cooling

6 BoilerGrid Purdue’s Campus Grid – West Lafayette Campus 23,000 cores – X86_64, ia32, ia64 Linux Idle HPC nodes in Rosen Center clusters –Solaris, MacOS X –Windows Instructional lab systems at main West Lafayette campus and Purdue’s regional campuses

7 BoilerGrid Backfilling on idle HPC cluster nodes –Condor runs on idle cluster nodes (nearly 10,000 cores today) when a node isn’t busy with PBS (primary scheduler) jobs

8 BoilerGrid Windows systems (~7000 cores) –Instructional Labs Purdue’s TLT division has run Condor on labs since 2001 – Supporting student rendering, some faculty research –Library terminals Dedicated Condor resources –GPU rendering cluster –FPGA computation accelerator

9 BoilerGrid around Campus To date, the bulk of BoilerGrid cycles are provided by ITaP, Purdue’s central IT –Rosen Center for Advanced Computing (RCAC) – Research Computing Community Clusters – See –Teaching and Learning Technologies (TLT) – Student Labs Centrally operated Linux clusters provide approximately 12k cores Centrally operated student labs provide 7k Windows cores That’s actually a lot of cores now, but there’s more around a large campus like Purdue –27, 317 machines, to be exact –Can the campus grid cover most of campus?

10 Target: All of Campus Green Computing is big everywhere, Purdue is no exception CIO’s challenge – power-save your idle computers, or run Condor and join BoilerGrid –University’s President runs Condor on her PC Centrally supported workstations have Condor available for install through SCCM. Thou shalt turn off thy computer or run Condor

11 Other Campus Grids Grid Laboratory of Wisconsin (GLOW) –University of Wisconsin, Madison FermiGrid –Fermi National Accelerator Lab Clemson University Rochester Institute of Technology

12 DiaGrid New name for our effort to spread the campus grid gospel beyond Purdue’s borders –Perhaps institutions who wear red or green and may be rivals on the gridiron or hardwood wouldn’t like being in something named “Boiler”. We’re regularly asked about implementing a Purdue-style campus grid at institutions without HPC on their campus. –Federate our campus grids into something far greater than what one institution can do alone

13 DiaGrid Partners Sure, it’d make a good basketball tournament… Purdue - West Lafayette Purdue Regionals –Calumet –North Central –IPFW –Statewide Technology –Cooperative Extension Offices Indiana University Notre Dame Indiana State Wisconsin (GLOW) –Via JobRouter Louisville Your Campus??

14 National scale: TeraGrid The Purdue Condor Pool is a resource available for allocation to anybody in the nation today NSF now recognizes high-thoughput computing resources as a critical part of the nation’s cyberinfrastructure portfolio going forward. –Not just megaclusters, XT5s, Blue Waters, etc, but loosely-coupled as well NSF vision for HPC - Sharing among academic institutions to optimize the accessibility and use of HPC as supported at the campus level –This matches closely with our goal to spread the gospel of the campus grid via DiaGrid

15 High Throughput Computing Like the Top 500 List, High Performance Computing is often measured by floating point operations per second (FLOPS) High Throughput Computing is concerned with how many floating point operations per month or per year they can extract from their computing environment rather than the number of such operations the environment can provide them per second or minute.

16 Impact - Disciplines Supply Chain Simulations Structural Biology (viruses) Astrophysics Particle Physics Mathematics Economics Communication Materials Science Hydrology Bioinformatics

17 Impact

18 Condor

19 The Condor Software Available as a free download from Download Condor for your operating system –Available for most UNIX (including Linux and Apple’s OS X) platforms –Windows NT / XP / Vista

20 Full featured system Flexible scheduling policy engine via ClassAds –Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold… Facilities to manage BOTH dedicated CPUs (clusters) and non-dedicated resources (desktops) Transparent Checkpoint/Migration for many types of serial jobs No shared file-system required Federate clusters w/ a wide array of Grid Middleware

21 Full featured system Workflow management (inter-dependencies) Support for many job types – serial, parallel, etc. Fault-tolerant: can survive crashes, network outages, no single point of failure. Development APIs: via SOAP / web services, DRMAA (C), Perl package, GAHP, flexible command-line tools, MW Platforms: Linux i386/IA64, Windows 2k/XP/Vista, MacOS, FreeBSD, Solaris, IRIX, HP-UX, Compaq Tru64, … lots. –IRIX and Tru64 are no longer supported by current releases of Condor

22 Condor – at 200 mph We could talk about Condor all day.. –So just the highlights content/uploads/2008/05/indy500_start2.jpg

23 Meet Phil. He is a scientist with a big problem.

24 Phil’s Application … Run a Parameter Sweep of F(x,y,z) for 200 values of x, 100 values of y and 30 values of z –200×100×30 = 600,000 combinations –F takes on the average 6 hours to compute on a “typical” workstation ( total = 600,000 × 6 = 3,600,000 hours: 410 years ) –F requires a “moderate” (512 MB) amount of memory –F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

25 I have 600,000 simulations to run. Where can I get help?

26 NSF won’t fund the Blue Gene that I requested.

27 While sharing a beverage with some colleagues, Phil shares his problem. Somebody asks “Have you tried Condor?.”

28 Phil Installs a “Personal Condor” on his machine… What do we mean by a “Personal” Condor? –Condor on your own workstation –No root / administrator access required –No system administrator intervention needed After installation, Phil submits his jobs to his Personal Condor…

29 personal Condor F(3,4,5) 600k Condor jobs Phil's workstation Phil’s Condor Pool

30 Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?

31 Condor will... Keep an eye on your jobs and will keep you posted on their progress Implement your policy on the execution order of the jobs Keep a log of your job activities Add fault tolerance to your jobs Implement your policy on when the jobs can run on your workstation

32 Definitions Job –The Condor representation of your work Machine –The Condor representation of computers and that can perform the work Match Making –Matching a job with a machine “Resource”

33 Job Jobs state their requirements and preferences: I need a Linux/x86 platform I want the machine with the most memory I prefer a machine in the chemistry department

34 Machine Machines state their requirements and preferences: Run jobs only when there is no keyboard activity I prefer to run Phil’s jobs I am a machine in the physics department Never run jobs belonging to Dr. Smith

35 The Magic of Matchmaking Jobs and machines state their requirements and preferences Condor matches jobs with machines based on requirements and preferences

36 Using the Vanilla Universe Using the Vanilla Universe The Vanilla Universe: – Allows running almost any “serial” job – Provides automatic file transfer, etc. – Like vanilla ice cream Can be used in just about any situation

37 Make your job batch-ready Must be able to run in the background No interactive input No windows No GUI

38 Create a Submit Description File A plain ASCII text file Condor does not care about file extensions Tells Condor about your job: –Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

39 Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = my_job Output = output.txt Queue

40 4. Run condor_submit You give condor_submit the name of the submit file you have created: –condor_submit my_job.submit condor_submit: –Parses the submit file, checks for errors –Creates a “ClassAd” that describes your job(s) –Puts job(s) in the Job Queue

41 ClassAd ? Condor’s internal data representation –Similar to classified ads (as the name implies) –Represent an object & its attributes Usually many attributes –Can also describe what an object matches with

42 ClassAd Details ClassAds can contain a lot of details –The job’s executable is analysis.exe –The machine’s load average is 5.6 ClassAds can specify requirements –I require a machine with Linux ClassAds can specify preferences –This machine prefers to run jobs from the physics group

43 ClassAd Details (continued) ClassAds are: –semi-structured –user-extensible –schema-free –Attribute = Expression

44 Example: MyType = "Job" TargetType = "Machine" ClusterId = 1377 Owner = "roy" Cmd = "sim.exe" Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … ClassAd Example String Number Boolean

45 The Dog ClassAd Type = “Dog” Color = “Brown” Price = 12 ClassAd for the “Job”... Requirements = (type == “Dog”) && (color == “Brown”) && (price <= 15)...

46 F(3,4,5) 600k Condor jobs Phil's workstation Phil can still only run one job at a time, however. Phil’s Condor Pool personal Condor

47 The Boss says Phil can add his co- workers’ desktop machines into his Condor pool as well… but only if they can also submit jobs. (Boss) Good News

48 Adding nodes Phil installs Condor on the desktop machines, and configures them with his machine as the central manager –The central manager: Central repository for the whole pool Performs job / machine matching, etc. These are “non-dedicated” nodes, meaning that they can't always run Condor jobs

49 600k Condor jobs Condor Pool Now, Phil and his co- workers can run multiple jobs at a time so their work completes sooner. Phil’s Condor Pool

50 How can my jobs access their data files?

51 Condor File Transfer ShouldTransferFiles = YES –Always transfer files to execution site ShouldTransferFiles = NO –Rely on a shared filesystem ShouldTransferFiles = IF_NEEDED –Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = IF_NEEDED Transfer_input_files = dataset.$(Process), common.data Transfer_output_files = TheAnswer.dat Queue 600

52 Condor Pool With the additional resources, Phil and his co- workers can get their jobs completed even faster. 600k Condor jobs Phil’s Condor Pool Dedicated Cluster

53 Now what? Some of the machines in the pool can’t run my jobs –Not enough RAM –Not enough scratch disk space –Required software not installed –Etc.

54 Specify Requirements An expression (syntax similar to C or Java) Must evaluate to True for a match to be made Universe = vanilla Executable = my_job Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Queue 600

55 Advanced Requirements Requirements can match custom attributes in your Machine Ad –Can be added by hand to each machine Universe = vanilla Executable = my_job Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > \ && ( HasMATLAB =?= TRUE) ) Queue 600

56 And, Specify Rank All matches which meet the requirements can be sorted by preference with a Rank expression. Higher the Rank, the better the match Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Rank = (KFLOPS*10000) + Memory Queue 600

57 What does the IT shop need to know? The IT administrator should know: –Condor’s daemons –Policy Configuration –Security –Virtualization

58 Typical Condor Pool Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway = Process Spawned Submit-Only master schedd Execute-Only master startd Regular Node schedd startd master Regular Node schedd startd master Execute-Only master startd

59 59 Execute MachineSubmit Machine Job Startup Submit Schedd Starter Job Shadow Condor Syscall Lib Startd Central Manager CollectorNegotiator

60 Ok, now what? Default configuration is pretty sane –Only start a job when the keyboard is idle for > 15 minutes and there is no CPU load –Terminate a job when the keyboard or mouse is used, or when the CPU is busy for more than two minutes Can one customize how Condor behaves?

61 Policy Expressions Allow machine owners to specify job priorities, restrict access, and implement local policies

62 (The Boss) I asked the computer lab folks to add nodes into Condor… but the jobs from their users have priority there Policy Configuration

63 New Settings for the lab machines Prefer lab jobs START = True RANK = Department == ”Lab” SUSPEND = False CONTINUE = True PREEMPT = False KILL = False

64 Submit file with Custom Attribute Prefix an entry with “+” to add to job ClassAd Executable = 3dsmax.exe Universe = vanilla +Department = “Lab" queue

65 More Complex RANK Give the machine’s owners (psmith and jpcampbe) highest priority, followed by Lab, followed by the Physics department, followed by everyone else.

66 More Complex RANK IsOwner = (Owner == ”psmith" || Owner == ”jpcampbe") IsTLT =(Department =!= UNDEFINED && Department == ”Lab") IsPhys =(Department =!= UNDEFINED && Department == "Physics") RANK = $(IsOwner)*20 + $(IsTLT)*10 + $(IsPhys)

67 So far this is okay, but... Condor can use staff desktops when they would otherwise be idle Policy Configuration (The Boss)

68 Defining Idle One possible definition: –No keyboard or mouse activity for 5 minutes –Load average below 0.3

69 Desktops should START jobs when the machine becomes idle SUSPEND jobs as soon as activity is detected PREEMPT jobs if the activity continues for 5 minutes or more KILL jobs if they take more than 5 minutes to preempt

70 Policies Policies are nearly infinitely customizable! –If you can describe it, you can make Condor do it! A couple examples follow

71 71 Custom Machine Attributes Can add attributes to a machine’s ClassAd, typically done in the local config file HAS_MATLAB=TRUE NETWORK_SPEED=1000 MATLAB_PATH=“c:\matlab\bin\matlab.exe” STARTD_EXPRS=HAS_MATLAB, MATLAB_PATH, NETWORK_SPEED

72 72 Custom Machine Attributes Jobs can now specify Rank and Requirements using new attributes: Requirements = (HAS_MATLAB =?= UNDEFINED || HAS_MATLAB==TRUE) Rank = NETWORK_SPEED =!= UNDEFINED && NETWORK_SPEED

73 START policies Time of Day Policy –WorkHours = ( (ClockMin >= 480 && ClockMin 0 && ClockDay = 1020) || \ (ClockDay == 0 || ClockDay == 6) ) # Only start jobs after hours. START = $(AfterHours) && $(CPUIdle) && KeyboardIdle > $(StartIdleTime) # Consider the machine busy during work hours, # or if the keyboard or CPU are busy. MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )

74 START policies Policy to keep your network from saturating from off-campus jobs SmallRemoteJob = ( DiskUsage <= && \ FileSystemDomain != “my.filesystem.domain”) # Only start jobs that don’t bring along # huge amounts of data from off-campus. START = $(SmallRemoteJob) && $(START)

75 Security

76 Host/IP Address Security The basic security model in Condor –Stronger security available (Encrypted communications, cryptographic authentication) Can configure each machine in your pool to allow or deny certain actions from different groups of machines

77 Advanced Security Features AUTHENTICATION – Who is allowed ENCRYPTION - Private communications, requires AUTHENTICATION. INTEGRITY - Checksums

78 Security Features Features individually set as REQUIRED, PREFERRED, OPTIONAL, or NEVER Can set default and for each level ( READ, WRITE, etc) All default to OPTIONAL Leave NEGOTIATION at OPTIONAL

79 Authentication Complexity Authentication comes at a price: complexity Authentication between machines requires an authentication system Condor supports several existing authentication systems –We don’t want to create yet another one

80 AUTHENTICATION_METHODS Authentication requires one or more methods: –FS –FS_REMOTE –GSI –Kerberos –NTSSPI –CLAIMTOBE

81 Networking

82 Networking Each submit node and potential execute node must –Be able to communicate with each other –Full bidirectional communication Firewalls are a problem –We can deal with that, see next slide NAT is more of an issue…

83 Networking Firewalls –Port 9618 needs to be open to your central manager, from all of your execute machines –Define range for dynamic ports HIGHPORT = LOWPORT = –And open corresponding ports in firewall –Condor can install its own exception in Windows firewall configuration

84 Virtualization

85 Condor’s VM Universe Execute Machine Startd VM Startd Job Submit Machine Schedd

86 Condor’s VM Universe Rather than submit a program into potentially unknown execution environments, why not submit the environment? The VM image is the job Job output is the modified VM image VMWare and Xen are supported

87 Virtual Condor Appliance Engineering is Purdue’s largest non-central IT organization – 4000 machines –Already a BoilerGrid partner, providing nearly 1000 cores of Linux cluster nodes to BoilerGrid. But what about desktops? What about Windows? –Engineering is interested... But… Engineering leadership wants the ability to sandbox Condor away from systems holding research or business data. Can we do this?

88 Virtual Condor Appliance Sure! Distribute virtual machine images running a standard OS and Condor Configuration –CentOS 5.2 –Virtual private p2p networking –Encryption, authentication

89 Virtual Condor Appliance For us and partners on campus, this is a win –Machine owners get their sandbox –Our support load to bring new machine owners online gets easier –Execution environments become consistent Much of the support load with new “sites” is firewall and Condor permissions. –Virtual machines and virtual “IPOP” network makes that all go away. Not only native installers for campus users, but now a VM image –With installer to run virtual nodes as a service –Systray app to forward keyboard/mouse events to virtual guests Not Virtualization implementation dependent – we can prepare and distribute VM images with KVM, VirtualBox, Vmware, Xen, and so on. –Just VMWare currently –We’ll offer more in the future. Condor Week 2009

90 Whew!!! it live or is it memorex/QueenB8271/MemorexAdPhoto.jpg>

91 I could also talk lots about… GCB: Living with firewalls & private networks Federated Grids/Clusters APIs and Portals MW Database Support (Quill) High Availability Fail-over Compute On-Demand (COD) Dynamic Pool Creation (“Glide-in”) Role-based prioritization and accounting Strong security, incl privilege separation Data movement scheduling in workflows …

92 Conclusion Campus Grids are effective ways of bringing high- performance computing to campus –Using institution’s existing investment in computing The Condor software is an excellent framework for implementing a Campus Grid –Flexible –Powerful –Minimal extra work for lab adminstrators! Just one more package in your image Virtualization with Condor –Improve security of machine owners’ systems –Improve grid manageability –Consistency

93 The End Questions? Interested in a campus grid at your institution? Want to join DiaGrid?


Download ppt "Building a Campus Grid with Existing Resources LabMan Conference, Notre Dame June 8-9, 2009 Preston Smith Purdue University."

Similar presentations


Ads by Google