Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.

Slides:



Advertisements
Similar presentations
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
Advertisements

CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.
ALICE G RID SERVICES IP V 6 READINESS
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
1 Internet Networking Spring 2004 Tutorial 13 LSNAT - Load Sharing NAT (RFC 2391)
CSE 190: Internet E-Commerce Lecture 16: Performance.
Chien-Chung Shen Google Compute Engine Chien-Chung Shen
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
Computer Network (MASQ/NAT/PROXY)
Putting the Network to Work
Load Sharing and Balancing - Saravanan Mathialagan Masters in Computer Science Georgia State University.
Additional SugarCRM details for complete, functional, and portable deployment.
1 Oracle 9i AS Availability and Scalability Margaret H. Mei Senior Product Manager, ST.
G RID SERVICES IP V 6 READINESS
Application-Layer Anycasting By Samarat Bhattacharjee et al. Presented by Matt Miller September 30, 2002.
Submitted by: Shailendra Kumar Sharma 06EYTCS049.
Internet Ethernet Token Ring Video High Speed Router Host A: Client browser: REQUEST:http//mango.ee.nogradesu.edu/c461.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Network as a Service Use cases for load balancing.
Addressing Issues David Conrad Internet Software Consortium.
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,
Zvezdan Pavković. Storage Non-Persistent Storage Persistent Storage Easily add additional storage. Networking Internal and Input Endpoints configured.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
When you run a scenario, the Vusers on each host machine use the host’s IP address. You can define multiple IP addresses on a host machine to emulate a.
Complete VM Mobility Across the Datacenter Server Virtualization Hyper-V 2012 Live Migrate VM and Storage to Clusters Live Migrate VM and Storage Between.
Monitoring with InfluxDB & Grafana
Deploying services with Mesos at a WLCG Tier 1 Andrew Lahiff, Ian Collier HEPIX Spring 2016 Workshop, DESY Zeuthen.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
Deploying Docker Datacenter on AWS © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
MySQL HA An overview Kris Buytaert. ● Senior Linux and Open Source ● „Infrastructure Architect“ ● I don't remember when I started.
1 Super/Ultra-Basic Load-Balancing Introduction For AFNOG 2012 Joel Jaeggli.
Linux Virtual Server Jim Lawson VAGUE/University of Vermont /
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
ALL THINGS IIS TERRI DONAHUE
dCache Paul Millar, on behalf of the dCache Team
ArcGIS for Server Security: Advanced
WLCG IPv6 deployment strategy
Understanding and Improving Server Performance
Introduction of load balancers at the RAL Tier-1
NAT、DHCP、Firewall、FTP、Proxy
Layer 3 Redundancy 1. Hot Standby Router Protocol (HSRP)
Contents Software components All users in one location:
Web application hosting with Openshift, and Docker images
Configuring ALSMS Remote Navigation
Affinity Depending on the application and client requirements of your Network Load Balancing cluster, you can be required to select an Affinity setting.
High Availability Linux (HA Linux)
Service Challenge 3 CERN
Programming Assignment #1
Practical Censorship Evasion Leveraging Content Delivery Networks
Network Load Balancing Functionality
Introduction to Networking
ROUND ROBIN DNS Round robin DNS is usually used for balancing the load of geographically distributed Web servers. For example, a company has one domain.
6.6 Firewalls Packet Filter (=filtering router)
2017 Real Questions
Kubernetes Container Orchestration
INSTALLING AND SETTING UP APACHE2 IN A LINUX ENVIRONMENT
IIS.
I. Basic Network Concepts
AKAMAI INTELLIGENT PLATFORM™
TCP/IP Networking An Example
World Wide Web Uniform Resource Locator hostname [:port]/path
AbbottLink™ - IP Address Overview
ECE 4450:427/527 - Computer Networks Spring 2017
Steven Feltner reveller – IRC
LOAD BALANCING INSTANCE GROUP APPLICATION #1 INSTANCE GROUP Overview
System Center Configuration Manager Cloud Services – Cloud Distribution Point Presented By: Ginu Tausif.
Web Servers (IIS and Apache)
Client/Server Computing and Web Technologies
VNet and Cross-Premises Connectivity
Presentation transcript:

Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1

DNS aliases -bash-4.1$ nslookup srm-atlas.gridpp.rl.ac.uk Server: Address: #53 Name:srm-atlas.gridpp.rl.ac.uk Address: Name:srm-atlas.gridpp.rl.ac.uk Address: Name:srm-atlas.gridpp.rl.ac.uk Address: Name:srm-atlas.gridpp.rl.ac.uk Address: What’s wrong with this? What we used to do (& still mostly do) 2

What if one ATLAS SRM dies overnight? –The dead machine is still visible to users via the DNS alias 1/N requests will fail We would need to contact RAL networking to change it –We get a pager alarm Someone will look into it, even at 2am Sunday morning Upgrades & reboots –As machines being updated are still in the DNS alias, maintainance is visible to users 1/N requests will fail what happens if a “quick reboot” takes longer than expected? (fsck, problem with VM,...) Problems with DNS aliases 3

CERN seem to do this (I think) This is perhaps better, but still has problems –What about DNS libraries not respecting DNS TTLs and caching the results of name lookups? –Many applications do DNS lookups once and cache the results –Issues with IPv6 DNS round-robin round-robin doesn’t work – many clients will always pick the “first” host What if we had a more dynamic DNS? 4

Instead of users connecting directly to servers... The alternative user server 1 server 2 server 3 server 4 5 site firewall hole

Put a load balancer in between The alternative load balancer user server 1 server 2 server 3 server 4 6 site firewall hole

HAProxy –Open source load balancer for TCP & HTTP Keepalived –Linux Virtual Server (LVS) router –Can provide HA floating IP addresses using Virtual Redundancy Routing Protocol (VRRP) Building blocks 7

Architecture at RAL Keepalived HAProxy Keepalived HAProxy user DNS floating IP 1 floating IP 2 server 1 server 2 server 3 server 4 load balancer 1 load balancer 2 Each service uses 2 floating IPs with a DNS alias 8 Each load balancer running on a VM

If a backend server dies Keepalived HAProxy Keepalived HAProxy user DNS floating IP 1 floating IP 2 server 1 server 2 server 3 server 4 load balancer 1 load balancer 2 HAProxy stops sending requests to the broken server 9

If a HAProxy dies Keepalived HAProxy Keepalived HAProxy user DNS floating IP 1 floating IP 2 server 1 server 2 server 3 server 4 load balancer 1 load balancer 2 The floating IP(s) move to the other load balancer 10 Keepalived checks if HAProxy is running

If a load balancer host dies Keepalived HAProxy Keepalived HAProxy user DNS floating IP 1 floating IP 2 server 1 server 2 server 3 server 4 load balancer 1 load balancer 2 The floating IP(s) move to the other load balancer 11

Add a new backend server Only need to update HAProxy config Keepalived HAProxy Keepalived HAProxy user DNS floating IP 1 floating IP 2 server 1 server 2 server 3 server 4 load balancer 1 load balancer 2 server 5 12

HAProxy has configurable built-in checks –tcp connection attempt –tcp, expect response to contain a specified string –SSLv3 client hello –http response code –http, expect response to contain a specified string –MySQL, PostgreSQL –SMPT For FTS3 –using tcp (SOAP API), SSLv3 (RESTful API, monitoring app) How to check if backend servers are healthy? 13

Round-robin –each backend server used in turn –using for FTS3 (soap), FTS3 (REST) Source –each client IP always goes to the same backend server –used for FTS3 (monitoring app) no more endless “identify yourself with a certificate” requests from your browser! compare to Also some others –leastconn, first,... Load balancing options 14

HAProxy stats page Monitoring 15

Use Telegraf to send HAProxy metrics to InfluxDB Monitoring 16

Nagios tests per load balancer –check that the number of healthy backend servers for each service is above a minimum threshold Tests for floating IPs Monitoring 17

With FTS3 only, load is very, very low! –Average 13 sessions/sec, 3 concurrent sessions (per lb) Current load 18

Random example using Apache Bench, zero tuning –8000 sessions/sec, 800 concurrent sessions Testing at higher scales 19 (using backend servers running httpd)

Services fully using the load balancers: –FTS3 (“test” instance, i.e. ATLAS): since 26 th April –FTS3 (“prod” instance, i.e. CMS etc): since 31 st May No problems so far Current status 20

More advanced HAProxy health checks e.g. host cert expiry, CPU load too high,... More services –Other existing services –Future services that require true high availability SCD OpenStack (dev instance already using HAProxy, Keepalived) Ceph gateways (e.g. gridFTP control traffic) It’s also the first step required before moving to a more dynamic infrastructure –e.g. container orchestration Future plans 21