13,000 Jobs and counting…. Advertising and Data Platform Our System.

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
High Availability Deep Dive What’s New in vSphere 5 David Lane, Virtualization Engineer High Point Solutions.
CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Mainframe Replication and Disaster Recovery Services.
Introduction to MySQL Administration.  Server startup and shutdown ◦ How to manually start and stop it from the command line ◦ How to arrange an automated.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Symantec De-Duplication Solutions Complete Protection for your Information Driven Enterprise Richard Hobkirk Sr. Pre-Sales Consultant.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Yes, yes it does! 1.Guest Clustering is supported with SQL Server when running a guest operating system of Windows Server 2008 SP2 or newer.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Hardening Linux for Enterprise Applications Peter Knaggs & Xiaoping Li Oracle Corporation Sunil Mahale Network Appliance Session id:
Implementing High Availability
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
ENTERPRISE JOB SCHEDULER SAJEEV RAMAKRISHNAN 29 AUG 2014.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Zabbix Performance Tuning
D0 Taking Stock1 By Anil Kumar CD/CSS/DSG July 10, 2006.
EarthLink Cloud Server Backup. Typical Business Challenges Does my tape back system provide the instantaneous access and rapid recovery that I need? How.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
It is one of the techniques to create a stand by server. Introduced in SQL 2000,enhanced in It is a High Availability as well as Disaster recovery.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
DATABASE ADMINISTRATION WHAT IS IT?. THE GIST Database administrators are responsible for creating and maintaining the databases that form the core of.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1EMC CONFIDENTIAL—INTERNAL USE ONLY FAST VP and Exchange Server 2010 Don Turner Consultant Systems Integration Engineer Microsoft TPM.
MySQL and GRID status Gabriele Carcassi 9 September 2002.
Alwayson Availability Groups
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Running a production Jenkins instance Harpreet Singh, Senior Director, Product Management Kohsuke Kawaguchi Jenkins founder ©2012 CloudBees, Inc. All Rights.
Monitoring with InfluxDB & Grafana
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
SQL Server 2012: AlwaysOn HA and DR Design Patterns, and Lessons Learned from Early Customer Deployments Sanjay Mishra SQLCAT.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
#SummitNow Inspecting Alfresco – Tools and Techniques Nathan McMinn Technical Consultant - Alfresco.
Apache Kafka A distributed publish-subscribe messaging system
AlwaysOn In SQL Server 2012 Fadi Abdulwahab – SharePoint Administrator - 4/2013
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
Redmond Protocols Plugfest 2016 Kevin Farlee Senior Program Manager SQL Server AlwaysOn in SQL Server 2016.
Syncsort Confidential and Proprietary – do not copy or distribute 1 Name Title 1 ECX Enterprise Catalog Syncsort Confidential and Proprietary – do not.
100% Exam Passing Guarantee & Money Back Assurance
Consulting Services JobScheduler Architecture Decision Template
Trends like agile development and continuous integration speak to the modern enterprise’s need to build software hyper-efficiently Jenkins:  a highly.
Database Management Systems (CS 564)
2018 Amazon AWS DevOps Engineer Professional Dumps - DumpsProfessor
What’s new in SQL Server 2016 Availability Groups
Clouds & Containers: Case Studies for Big Data
SQL Server on Linux: High Availability And Disaster Recovery
Dana Kaufman SQL Server Appliance Engineering
AlwaysOn Availability Groups
Distributing META-pipe on ELIXIR compute resources
Features Overview.
AEM Operations Dec 2017.
Designing Database Solutions for SQL Server
Presentation transcript:

13,000 Jobs and counting…

Advertising and Data Platform Our System

Our Team We provide Jenkins Infrastructure as service and develop tools related to Continuous Delivery Product teams own and manage their CD pipelines, they configure jobs, etc We don’t control what is in the job. It is shared resource and we trust our engineers to be smart. There is enough monitoring to check the health of the infrastructure Teams rely on this infrastructure for their deployments and they expect this infrastructure to be up

Jenkins Infrastructure At A Glance: 1 Primary Jenkins Master and 3 Backup Masters in 2 data centers 50 Jenkins Slaves in 3 data centers 400+ Executors Hardware Configuration 2 x Xeon E GHz, 4.80GT QPI (HT enabled, 12 cores, 24 threads) 96G memory 1.2TB disk Supports RHEL, FreeBSD and Mac Builds 20TB Filer Volume to store Jenkins Job and Build data

Key Metrics At A Glance: 13,000+ Jobs 8,000+ builds per day 2M+ builds per year 6TB build data Average Build Status 80% Success 20% Failure

YOY – Number of Builds

Physical Architecture CNAME DNS Rotation DC1 Filer Storage Jenkins Master Primary Server Jenkins Master Secondary Server Jenkins Master Primary Server Jenkins Master Secondary Server Jenkins Slaves 25 RHEL, FreeBSD and Mac Slaves DC2 Filer Storage Snap Mirror Replication between DC1 and DC2 Filer MySQL Database Jenkins Dasboard Crawler DC1 DC2 Jenkins Data

Issues and Solution Multiple Build Environments Issues Can’t scale if we run only one build on a slave Running multiple builds at same time conflicts with each other Solution Use light weight container In our case we use heavily augmented version of the standard UNIX command chroot

Issues and Solution JVM Issues Jenkins loads configuration of Jobs and their history into memory when it starts up. JVM performance conundrum Solution Increased the memory on the master Allotted JVM Heap: 48GB JVM Heap Used: Min: 5GB Avg: 10GB Max: 15.5GB

Issues and Solution High Availability Issues Loose data when Jenkins master crashes If backup exists, takes many hours to setup new master from backup Solution Moved Jenkins configuration and data to filer, with mirror Allowed us to switch to back up / Disaster Recovery (DR) Jenkins master in seconds. 4 masters behind DNS Rotation 2 Masters in each Prod and DR colo 99% uptime for master

Issues and Solutions Huge console log crash Jenkins Issues When console log gets too big, JVM crashes due to OOM Solution Used opensource ‘Log File Checker’ plugin to fail the job if console log reaches 200MB

Issues and Solutions JMX Plugin Issues: Jenkins API is not rich enough to monitor build queue and executors. Solution Jenkins plugin for attributes of the application's data internal model via The following is a list of MBeans exposed by this plugin BusyExecutors - Total number of executor threads that were running a build TotalExecutors - Total number of executor threads across all nodes BuildableItemCount BlockedItemCount WaitingItemCount ItemCount

JMX Plugin

Issues and Solutions Cleanup Issues: Jenkins provides ‘Discard old builds’ feature. This controls the disk consumption of Jenkins by managing number of builds. But there are no feature to control disk consumption like managing workspace, chroot, jobs etc. Solution Added script to implement data retention policy

Data Retention / Backup More than 35 thousands jobs and 6 million builds since beginning. All these data cant be kept since Jenkins loads Jobs and its history in memory. To address we needed to do the following data retention policy Job Retention Policy: Jobs with no builds for 120 days are archived and removed. Build Retention Policy: Keep only last 150 builds Workspace Clean: Remove workspace from all slaves except where last build ran. Chroot Clean Up Policy: Remove chroot 18 hrs or older. The master configuration and all job configuration are backed up every 15 minutes.

Jenkins Dashboard Build Summary

Jenkins Dashboard Job Summary

CI Metrics & Trends

Build Highlights Plugin

What Broke The Build Plugin

Job Meta data Plugin

CD Pipeline

Splunk Dashboard

Problems Multi master support Load time and performance Concept of pipeline Resource consumption Cross Jenkins instance trigger