CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t The CERN Agile Infrastructure Project: Configuration and Operations Tools Helge Meinhard.

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Plans on Virtualization Ian Bird On behalf of IT WLCG Workshop, 9 th July 2010.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
CERN - IT Department CH-1211 Genève 23 Switzerland t SVN Pilot: CVS Replacement Manuel Guijarro Jonatan Hugo Hugosson Artur Wiecek David.
DNN LOVES JENKINS FOR CONTINUOUS INTEGRATION
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiences running a production Puppet Ben Jones HEPiX Bologna Spring.
Configuration Management Evolution at CERN Gavin
Continuous Integration and Code Review: how IT can help Alex Lossent – IT/PES – Version Control Systems 29-Sep st Forum1.
1 Quick Overview Overview Network –IPTables –Snort Intrusion Detection –Tripwire –AIDE –Samhain Monitoring & Configuration –Beltaine –Lemon –Prelude Conclusions.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Jose Castro Leon CERN – IT/OIS CERN Agile Infrastructure Infrastructure as a Service.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance.
Microsoft Management Seminar Series SMS 2003 Change Management.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
IBM Bluemix Ecosystem Development Hands on Workshop Section 1 - Overview.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards agile software development Marwan Khelif IT-CS-CT IT Technical Forum – 31th May.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Configuration Activities Gavin McCance Online Cross-experiment Meeting, 14 June 2012.
1 CERN IT Department CH-1211 Genève 23 Switzerland t Puppet in the CERN CC Tomas Karasek Steve Traylen Oct
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
Tim Bell 04/07/2013 Intel Openlab Briefing2.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
CERN AI Config Management 16/07/15 AI for INFN visit2 Overview for INFN visit.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Virtual Lab Overview 5/21/2015 xxxxxxxxxx NWS/MDL/CIRA.
Cloud Installation & Configuration Management. Outline  Definitions  Tools, “Comparison”  References.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Configuration Services at CERN HEPiX fall Ben Jones, HEPiX Fall 2014.
Automating operational procedures with Daniel Fernández Rodríguez - Akos Hencz -
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Virtualisation: status and plans Dag Toppe Larsen
DECTRIS Ltd Baden-Daettwil Switzerland Continuous Integration and Automatic Testing for the FLUKA release using Jenkins (and Docker)
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.
From manual test shop to fully automated test coverage: A How-To session to speed up your journey Jayshree Bhakta ITHAKA/JSTOR.
Stress Free Deployments with Octopus Deploy
IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Infrastructure Orchestration to Optimize Testing
Work Package 4 Software Integration and Distribution
Running Computers in CC
Drupal VM and Docker4Drupal For Drupal Development Platform
Drupal VM and Docker4Drupal as Consistent Drupal Development Platform
X in [Integration, Delivery, Deployment]
Simplified Development Toolkit
Technical Capabilities
OpenStack Summit Berlin – November 14, 2018
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t The CERN Agile Infrastructure Project: Configuration and Operations Tools Helge Meinhard / CERN-IT (replacing Manuel Guijarro) HEPiX Spring April 2012, Praha

Configuration and Operations Tools Agile Infrastructure - Configuration and Operation Tools

Project Scope  The project is reviewing the entire CERN computer-centre management toolset –What happens from the bare metal up –Asset management, inventory –Sysadmin tools and maintenance workflows –Service management and configuration tools –Dynamic configuration for ‘virtual’ hosts –Operations monitoring –Workflow automation and continuous deployment –… Agile Infrastructure - Configuration and Operation Tools

Configuration and Operations Tools Agile Infrastructure - Configuration and Operation Tools

Why?  Current production system built around the Quattor toolset is successfully managing O(10k) servers –(CERN) Quattor + many CERN components  Why are we changing the toolset? Agile Infrastructure - Configuration and Operation Tools

What are the Issues (1)  Uncompressible technical debt –The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources –Small community (less funding) and general support problem. At CERN, we’ve fallen into the “sticky hands” support model  We need better automation and integration between the sub-components –Lack of automated workflow: everything is a ticket  Script™ : your added value in the process is often your CERN password –The 15-min “CDB commit walk” – context switch cost Agile Infrastructure - Configuration and Operation Tools

What are the Issues (2)  Transferrable skills and training –Learning curve for our tools is steep and remains high –It’s easier to hire people who have skills in a widely-used tool than your internal tools  Depending on where you look Agile Infrastructure - Configuration and Operation Tools

Jobs Adverts – indeed.com Agile Infrastructure - Configuration and Operation Tools Index of millions of worldwide job posts across thousands of job sites These are the sort of posts our departing staff will be applying for. Puppet Quattor

Integration is Hard  IPv6, virtualisation, Windows Server all need a solution –We could leverage lots of open source tools  But piecemeal integration of these requires high investment due to our complex system  Years of organic growth have made the system way too ‘hairy’  It’s often easier to reinvent rather than integrate –Lack of ‘dynamic-ness’ in the infrastructure  We hack the config system for dynamic VMs  It’s critical to look at the system as a whole Agile Infrastructure - Configuration and Operation Tools

Where to Look?  Large ops community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers, many apps  Become standard and join this community Agile Infrastructure - Configuration and Operation Tools

Use Puppet for the Core  The tool space has exploded in the last few years –In configuration management and ops –Large, shared ‘tool forges’, and lots of experience  Puppet and Chef are the clear leaders for the ‘core’ tool –other tools in our ‘scope’ try to integrate with those  Many large-scale enterprises use Puppet –Its declarative approach fits better with what we are used to –Large installations: friendly, wide-base community and commercial support and training –You can buy books on it Agile Infrastructure - Configuration and Operation Tools

Scaling Challenges: Nodes  Currently we have O(10k) physical nodes  IaaS approach: –Moving to virtual machines –More (smaller, load-balanced) service nodes –VMs for raw compute (batch or pilot jobs) –Homogeneous: compute + storage on the same node  Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed –99.6% (1) node update success-rate means 1200 manual interventions to “fix it” (1) in a recent intervention on lxbatch Agile Infrastructure - Configuration and Operation Tools

Scaling Challenges: People  Many, diverse applications (“clusters”) managed by different teams ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system Agile Infrastructure - Configuration and Operation Tools

Agile Infrastructure 1 st Try (1)  First started investigating tools in September 2011 using ‘part-time’ resources from several IT groups –Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint review, visible –Take first, best-guess at architecture and tool selection, iterate  Mixed success with this agile style –What works: Good visibility and reviews. Daily “scrum” meeting useful. Weekly review meeting open to management. –What doesn’t: The “time boxing” part of Scrum sprints is hard with part-time resources –Now more staff available, but still mostly part-time efforts Agile Infrastructure - Configuration and Operation Tools

Agile Infrastructure 1 st Try (2)  We’re currently running: –OpenStack as cloud software for virtual machines, image management, bulk storage  See later presentation –Puppet for the configuration management core –…with Foreman as a dashboard Agile Infrastructure - Configuration and Operation Tools

Foreman Dashboard Agile Infrastructure - Configuration and Operation Tools

Agile Infrastructure 1 st Try (2)  We’re currently running: –OpenStack as cloud software for virtual machines, image management, bulk storage  See later presentation –Puppet for the configuration management core –…with Foreman as a dashboard  None of the tools are “perfect” out-of-the-box –.. but we’d rather submit patches to a good open source tool than re-implement it –We’ve experienced very good community support: RFCs and patches are quickly accepted –Very active community: often problems are fixed and missing features implemented before you even report them Agile Infrastructure - Configuration and Operation Tools

Agile Infrastructure 1 st Try (3)  We’re currently running: –yum for software distribution (replacing spma) –git for template management: why git?  Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates  Many of the tools we can benefit from also assume git  We should not be different from the rest of the community Agile Infrastructure - Configuration and Operation Tools

Puppet  Client/server architecture –“puppetmaster”: horizontally scalable Rails application –X509 cert authenticated nodes: integrate with CERN CA Agile Infrastructure - Configuration and Operation Tools

Puppet  Puppet runs on the client, applying the configuration changes –It detects the current state and only runs if there’s something to do  It runs every few minutes –new configuration will be ~immediately applied (“fail-fast”). –This is a change from CDB where ‘latent’ changes can be stacked up  Normal mode is client-side compile (“assume success”) –No more CDB commit waits –Change from CDB: the compilation fails later  Good monitoring is a pre-req: puppet sends reports back to the puppetmaster –The Foreman tool can collect these for you Agile Infrastructure - Configuration and Operation Tools

Puppet Language  Puppet uses its own Ruby-like language for the templates to “assert” the desired state of the nodes –With Ruby fall-back for hard stuff (we’ve only needed this once)  Being declarative rather than procedural, there are quirks –Takes a bit of practice to ‘get it’ –There are books, online docs, online cook-books, and a large community to help  It dispenses with the need for ncm components –All the work is done by puppet on the node itself – you just provide the template part to assert what you want done –Less software -> easier to move to new OS versions Agile Infrastructure - Configuration and Operation Tools

Externals  Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates  Node function + hardware –Moving a host between clusters is a DB update  Your configuration can use variables the node detects itself –e.g. reconfigure daemons based on where a newly live-migrated VM has found itself  Query the compiled configuration of other hosts –e.g. Open my firewall to the lxadm nodes Agile Infrastructure - Configuration and Operation Tools

Moving towards PaaS  Parametrisable recipes –Just fill in the blanks  The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template –e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box –…with these parameters  Moving us in the PaaS direction –Ultimately, it would be better if you never even needed to log into this node  (J2EE public service, IT web hosting service, MySQL service) Agile Infrastructure - Configuration and Operation Tools

Standard Workflow Agile Infrastructure - Configuration and Operation Tools check out from CDB update templates CDB commit run and check on test node notify with nc-client n minutes Iterate CDB on lxadm check out from git update templates git commit and push run and check on test node notify with mcollective 1 minute Iterate Puppet on lxadm check out from git on the test node update templates run puppet-apply check on test node notify with mcollective Iterate Puppet-apply on test node check on foreman check on node(s) check on foreman git commit and push

Modernising our Processes (1)  Our software processes for the computer centre are fairly limited –fire-and-forget broadcasts to project-elfms  …and rather manual –The manual test/ -> preprod/ -> prod/ template dance –Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand  Add standard continuous integration (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC –.. then automate the testing –e.g. suitably tagged RPMs are automatically deployed to /test nodes. Agile Infrastructure - Configuration and Operation Tools

Modernising our Processes (2)  We’re working out which of the many puppet / git models suits us –code review, sign-off and automated notification for changes that will affect multiple clusters –How to automate the test/preprod/prod advancement  Pre-req is flexible monitoring and alarming –you need to trust that an automation failure will be signaled to you  Script-generated s are banned –Need good monitoring to hang these notifications on  Integrate components rather than use Script™ –Script-generated tickets (where your value in the process is your password), are banned Agile Infrastructure - Configuration and Operation Tools

Current Tool Snapshot (Liable to Change) Agile Infrastructure - Configuration and Operation Tools Jenkins Koji, Mock Puppet Foreman AIMS/PXE Foreman AIMS/PXE Foreman Yum repo Pulp Yum repo Pulp Puppet stored config DB mcollective, yum JIRA Lemon git, SVN Openstack Nova Hardware database

Preliminary Timelines YearWhatActions 2011Agree overall principles 2012Prepare formal project plan Establish IaaS in CERN CC Production Agile Infrastructure Monitoring Implementation as per WG Migrate lxcloud Early adopters to Agile Infrastructure 2013LSD 1 New Data Centre Extend IaaS to remote CC Business Continuity Support Experiment App re-work Migrate CVI General migration to Agile with SLC6 and Windows LSD 1 (to November)Phase out Quattor/CDB/… Agile Infrastructure - Configuration and Operation Tools Aggressive schedule if we are to make it for new data centre

Initial Steps  Decided on tools  Integrating them to make a production setup –We can still change.. But we’re starting to commit…  Looking for early adopters –In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best?  e.g. PES/OIS services: batch/VMs, JIRA, Drupal  –Help with integration / coding –Help with ideas –Help with building the task list Agile Infrastructure - Configuration and Operation Tools

Summary  IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components –Puppet for the core configuration tool –Better integration between components –Use of more modern software processes to aid deployment –Better monitoring –Engage with the community rather than re-implement  Overall project scope is wider (see following presentations) –Improved monitoring –Cloud and virtualisation  Actively seeking wide involvement from CERN-IT and feedback from the community  Agile Infrastructure - Configuration and Operation Tools

Acknowledgements Many colleagues at CERN-IT, including –Tim Bell –Ian Bird –Bernd Panzer-Steindel –Gavin McCance –Manuel Guijarro Agile Infrastructure - Configuration and Operation Tools