Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling Puppet and Foreman for HPC

Similar presentations


Presentation on theme: "Scaling Puppet and Foreman for HPC"— Presentation transcript:

1 Scaling Puppet and Foreman for HPC
Trey Dockendorf HPC Systems Engineer Ohio Supercomputer Center

2 Introduction Puppet configuration management
Hiera YAML data for Puppet Foreman provisioning HPC Environment using NFS root Deployed to 1000 HPC and Infrastructure systems First used on Owens 824 node cluster

3 Motivation Requirement of any large HPC center is scaling the provisioning and management of HPC clusters Common provisioning and configuration management between compute and infrastructure Unified management of PXE, DHCP and DNS Audit networking interfaces Support testing configuration changes Unit and system

4 Foreman Host life cycle management
DNS, DHCP, TFTP – both Infrastructure and HPC Tuning: PassengerMaxPoolSize NFS root support required custom provisioning templates Local Boot PXE override to always network boot Workflow change for HPC – no ”Build”, use key-value TUNING – prevent overload of Foreman host

5 Foreman Key-Value Storage
Key-value stored in Foreman as Parameters Change behavior during boot nfsroot_build Change TFTP templates nfsroot_path nfsroot_host nfsroot_kernel_version Hierarchical storage provides inheritance base/owens -> base/owens/compute -> FQDN Managed using web UI and scripts via API host-parameter.py & hostgroup-parameter.py

6 Foreman NFS Root Provisioning
Provisioning handled by read/write host Support by Foreman written from scratch Read-only hosts have specific locations writable defined by /etc/statetab & /etc/rwtab statetab – persists through reboots rwtab – does not persist through reboots Read-only rebuild: nfsroot_build parameter osc-partition service Partition scripts generated by Puppet Defined using a partition schema in Hiera partition-wait script if nfsroot_build=false Statetab/rwtab Statetab used only when absolutely necessary to allow reboots ability to reset systems All puppet managed resources go in rwtab Read-only rebuild set by host-parameter.py Foreman Role allows students ability to rebuild nodes Osc-partition checks key-value of Foreman Partition schema defines contents of partition scripts Partition wait designed to wait for LVM to become available

7 NFS Root Boot Workflow

8 Parameter Manipulation Timing
Operation Avg Time (sec) StdDev (sec) host get 0.522 0.021 host list 0.467 0.025 host set 0.581 0.024 host set /w TFTP sync 1.321 0.057 host delete 0.433 0.026 host delete w/ TFTP sync 1.148 0.038 hostgroup get 0.510 hostgroup list 0.514 0.034 hostgroup set 0.526 0.030 hostgroup delete 0.489

9 Scaling Puppet – Standalone Hosts
Typically master compiles catalog for agents Scaling achieved by load balancing between masters Subject Alternative Name certificates – any master can be CA Masters synced with mcollective and r10k Environments isolated by r10k control repo and git branching Foreman acts as ENC (External Node Classifier) to Puppet Scaling Number of CPUs available to masters is number of concurrent agents Environments isolated… Modules not in control repo are defined by Puppetfile, mostly community modules ENC supplies important data to Puppet like IP address and hostgroup

10 Puppet Performance – Standalone Hosts
Min Max Mean Std Resource Count 644 9093 1313 806 Compile Times (sec) 14 82 21 9

11 Scaling Puppet – HPC Systems
Scaling achieved by removing master and using masterless Primary bottleneck is performance reading manifests and modules then compiling locally /opt/puppet – manifest and modules synced by mcollective and r10k Read-write hosts still use puppet masters Masterless puppet run via papply Environment isolation defaults to read-write host’s Stateful - use PuppetDB and Foreman ENC Stateless - uses minimal catalog run in two stages at boot Stateless - manages locations in rwtab

12 NFS Root Boot Workflow

13 Puppet Performance - Masterless
Min (sec) Max (sec) Mean (sec) Std (sec) Early 27 257 115 57 Late 87 414 185 73 Compile 3 19 9 Late contains 71 managed resources Late contains 60 second wait for filesystems Times collected from data of system bring-up after maintenance 60 second wait in order to get GPFS mounted before pbs_mom and NHC run

14 Cluster Definitions - YAML
YAML files define cluster Script used to sync YAML with Foreman Loaded into Puppet as Hiera data Make Puppet aware of cluster nodes and their properties Populates clustershell, pdsh, Torque, conman, powerman, SSH host based auth, etc YAML deployed to root filesystem as Python pickle and Ruby marshall YAML on root filesystem loaded to populate facts Informational such as node location Determine behavior when Puppet runs Ruby and Python based facts

15 Custom Fact Example Usage

16 Repositories Foreman Templates Foreman Plugin NFS Root Module
Foreman Plugin NFS Root Module Puppet Masterless Module Cluster Facts module


Download ppt "Scaling Puppet and Foreman for HPC"

Similar presentations


Ads by Google