Presentation is loading. Please wait.

Presentation is loading. Please wait.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Similar presentations


Presentation on theme: "PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University."— Presentation transcript:

1 PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

2 Introduction –Environment –Motivation Challenges –Infrastructure –Usage Tracking –Storage –Staffing Future Work Results BoilerGrid

3 BoilerGrid - Growth How did we get from here…. To here?

4 BoilerGrid - Rosen Center for Advanced Computing Research Computing arm of ITaP - Information Technology at Purdue Clusters in RCAC are arranged in larger “Community Clusters” –One cluster, one configuration, many owners –Leverages economies of scale for purchasing, and provides expertise in systems engineering, user support, and networking

5 BoilerGrid - Motivation Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity –Community clusters used approximately 70% of capacity –Condor installed on community clusters to cycle- scavenge from PBS, the primary scheduler Goal: provide a general-purpose high- throughput computing resource on existing hardware

6 BoilerGrid - Challenges In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software An overhaul of the Condor infrastructure was needed!

7 BoilerGrid - Keep Condor Up-to-date Upgrading Condor –In late 2005, we were running Condor version 6.6.5, which was 1.5 years old. –First, we needed to upgrade! In a large, busy, Condor grid, we found it’s usually advantageous to run the development release of Condor –Early access to new features, scalability improvements

8 BoilerGrid - Pool Design Use many machines –In 2005, we ran a single Condor pool with ~1800 machines. In 2005, the largest single Condor pools in existence were ~1000 machines. –We implemented BoilerGrid as a flock of 4 pools, of up to 1200 machines each. –Implementing BoilerGrid today? Would have looked much different!

9 BoilerGrid - Submit Hosts Many submit hosts –In 2005, a single host ran the Condor schedd and could submit jobs –Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs

10 BoilerGrid - Challenges Usage Tracking –Tracking job-level accounting with a large Condor pool is difficult –Job history resides on every submit host –Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information Deploying this on BoilerGrid now

11 BoilerGrid - Storage If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers. DAGMan and user logs on NFS can cause problems –The defaults don’t allow this for a reason! Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files

12 BoilerGrid - Expansion Successful use of Condor in clusters led us to identify partners around campus –Student computer labs operated by sister unit in ITaP (2500 machines and growing) –Library terminals (200 machines) –Other campuses (500+ machines) Management support is critical! –Purdue’s CIO supports using Condor on many machines run by ITaP, including the one on his own desk

13 BoilerGrid - Expansion An even better route of expansion –Condor users adding their own resources Machines in their own lab All the machines in their department With distributed ownership comes new challenges –Regular contact with owner’s system administration staff –Ensure that owners are able to set their own policies

14 BoilerGrid - Staffing Implementing BoilerGrid required minimal staff effort –Assuming an existing IT infrastructure exists that can operate many machines –.25 FTE ongoing to maintain Condor and coordinating with distributed Condor installations With success comes more demand, and the end-user support to go along with it –1.5 science support consultants assist with porting codes,training users to effectively use Condor

15 BoilerGrid - Future Work TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs Centralized Quill database for job and machine state –Excellent source of data for future research in distributed systems

16 BoilerGrid - Results YearPool Size JobsHours Delivered Unique Users 2004150043,551346,00014 20054000210,7171,695,00026 200661004,251,9815,527,00072 200777009,611,8139,524,000117 200813000+??63 so far..

17 BoilerGrid - Results

18 BoilerGrid - Conclusions Condor is a powerful tool for getting real science done on otherwise unused hardware http://www.rcac.purdue.edu/boilergrid Questions?


Download ppt "PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University."

Similar presentations


Ads by Google