BoilerGrid - Growth How did we get from here…. To here?
BoilerGrid - Rosen Center for Advanced Computing Research Computing arm of ITaP - Information Technology at Purdue Clusters in RCAC are arranged in larger “Community Clusters” –One cluster, one configuration, many owners –Leverages economies of scale for purchasing, and provides expertise in systems engineering, user support, and networking
BoilerGrid - Motivation Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity –Community clusters used approximately 70% of capacity –Condor installed on community clusters to cycle- scavenge from PBS, the primary scheduler Goal: provide a general-purpose high- throughput computing resource on existing hardware
BoilerGrid - Challenges In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software An overhaul of the Condor infrastructure was needed!
BoilerGrid - Keep Condor Up-to-date Upgrading Condor –In late 2005, we were running Condor version 6.6.5, which was 1.5 years old. –First, we needed to upgrade! In a large, busy, Condor grid, we found it’s usually advantageous to run the development release of Condor –Early access to new features, scalability improvements
BoilerGrid - Pool Design Use many machines –In 2005, we ran a single Condor pool with ~1800 machines. In 2005, the largest single Condor pools in existence were ~1000 machines. –We implemented BoilerGrid as a flock of 4 pools, of up to 1200 machines each. –Implementing BoilerGrid today? Would have looked much different!
BoilerGrid - Submit Hosts Many submit hosts –In 2005, a single host ran the Condor schedd and could submit jobs –Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs
BoilerGrid - Challenges Usage Tracking –Tracking job-level accounting with a large Condor pool is difficult –Job history resides on every submit host –Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information Deploying this on BoilerGrid now
BoilerGrid - Storage If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers. DAGMan and user logs on NFS can cause problems –The defaults don’t allow this for a reason! Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files
BoilerGrid - Expansion Successful use of Condor in clusters led us to identify partners around campus –Student computer labs operated by sister unit in ITaP (2500 machines and growing) –Library terminals (200 machines) –Other campuses (500+ machines) Management support is critical! –Purdue’s CIO supports using Condor on many machines run by ITaP, including the one on his own desk
BoilerGrid - Expansion An even better route of expansion –Condor users adding their own resources Machines in their own lab All the machines in their department With distributed ownership comes new challenges –Regular contact with owner’s system administration staff –Ensure that owners are able to set their own policies
BoilerGrid - Staffing Implementing BoilerGrid required minimal staff effort –Assuming an existing IT infrastructure exists that can operate many machines –.25 FTE ongoing to maintain Condor and coordinating with distributed Condor installations With success comes more demand, and the end-user support to go along with it –1.5 science support consultants assist with porting codes,training users to effectively use Condor
BoilerGrid - Future Work TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs Centralized Quill database for job and machine state –Excellent source of data for future research in distributed systems