-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project.

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project alanr@unix.sh (a.k.a. alanr@us.ibm.com) IBM Linux Technology Center

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Agenda High-Availability (HA) Clustering? What is the Linux-HA project? Linux-HA applications and customers Linux-HA release 1 / Release 2 /Feature Comparison Release 2 Details Request for Feedback DRBD – an important component Thoughts about cluster security

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 What Is HA Clustering? Putting together a group of computers which trust each other to provide a service even when system components fail When one machine goes down, others take over its work This involves IP address takeover, service takeover, etc. New work comes to the remaining machines Not primarily designed for high-performance

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Redundancy eliminates Single Points Of Failure (SPOF) Monitoring determines when things need to change Reduces cost of planned and unplanned outages by reducing MTTR (Mean Time To Repair) High Availability Through Redundancy and Monitoring

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Monitoring detects failures (hardware, network, applications) Automatic Recovery from failures (no human intervention) Managed restart or failover to standby systems, components Failover and Restart

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 What Can HA Clustering Do For You? It cannot achieve 100% availability – nothing can. HA Clustering designed to recover from single faults It can make your outages very short From about a second to a few minutes It is like a Magician's (Illusionist's) trick: When it goes well, the hand is faster than the eye When it goes not-so-well, it can be reasonably visible A good HA clustering system adds a “9” to your base availability 99->99.9, 99.9->99.99, 99.99->99.999, etc.

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Lies, Damn Lies, and Statistics Counting nines – downtime allowed per year

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 The Desire for HA systems Who wants low-availability systems? Why are so few systems High- Availability?

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Why isn't everything HA? Cost Complexity

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Complexity Complexity is the Enemy of Reliability

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Commodity HA? Installations with more than 200 Linux-HA pairs: Autostrada – Italy Italian Bingo Authority Oxfordshire School System Many retailers (through IRES and others): Karstadt's Circuit City etc. Also a component in commercial routers, firewalls, security hardware

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 The HA Continuum Single node HA system (monitoring w/o redundancy) Provides for application monitoring and restart Easy, near-zero-cost entry point – HA system starts init scripts instead of /etc/init.d/rc (or equivalent) Addresses Solaris / Linux functional gap Multiple Virtual Machines – Single Physical machine Adds OS crash protection, rolling upgrades of OS and application – good for security fixes, etc. Many possibilities for interactions with virtual machines exist Multiple Physical Machines (“normal” cluster) Adds protection against hardware failures Split-Site (“stretch”) Clusters Adds protection against site-wide failures (power, air-conditioning, flood, fire)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 How Does HA work? Manage redundancy to improve service availability Like a cluster-wide-super-init with monitoring Even complex services are now “respawn” on node (computer) death on “impairment” of nodes on loss of connectivity for services that aren't working (not necessarily stopped) managing potentially complex dependency relationships

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Single Points of Failure (SPOFs) A single point of failure is a component whose failure will cause near-immediate failure of an entire system or service Good HA design adds redundancy to eliminate single points of failure Non-Obvious SPOFs can require deep expertise to spot

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 The “Three R's” of High-Availability R edundancy If this sounds redundant, that's probably appropriate... Most SPOFs are eliminated by redundancy HA Clustering is a good way of providing and managing redundancy

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Redundant Communications Intra-cluster communication is critical to HA system operation Most HA clustering systems provide mechanisms for redundant internal communication for heartbeats, etc. External communications is usually essential to provision of service External communication redundancy is usually accomplished through routing tricks Having an expert in BGP or OSPF routing is a help

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Fencing Guarantees resource integrity in the case of certain difficult cases (split-brain) Four Common Methods: FiberChannel Switch lockouts SCSI Reserve/Release (painful to make reliable) Self-Fencing (like IBM ServeRAID) STONITH – Shoot The Other Node In The Head Linux-HA has native support for the last two

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Redundant Data Access Replicated Copies of data are kept updated on more than one computer in the cluster Shared Typically Fiber Channel Disk (SAN) Sometimes shared SCSI Back-end Storage ( “Somebody Else's Problem” ) NFS, SMB Back-end database All are supported by Linux-HA

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Data Sharing – Replication Some applications provide their own replication DNS, DHCP, LDAP, DB2, etc. Linux has excellent disk replication methods available DRBD is my favorite DRBD-based HA clusters are shockingly cheap Some environments can live with less “precise” replication methods – rsync, etc. Generally does not support parallel access Fencing usually required EXTREMELY cost effective

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Data Sharing – ServeRAID et al IBM ServeRAID SCSI controller is self-fencing This helps integrity in failover environments This makes cluster filesystems, etc. impossible No Oracle RAC, no GPFS, etc. ServeRAID failover requires a script to perform volume handover Linux-HA provides such a script in open source Linux-HA is ServerProven with ServeRAID

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Data Sharing – Shared Disk The most classic data sharing mechanism – commonly fiber channel Allows for failover mode Allows for true parallel access Oracle RAC, Cluster filesystems, etc. Fencing always required with Shared Disk

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Data Sharing – Back-End Network Attached Storage can act as a data sharing method Existing Back End databases can also act as a data sharing mechanism Both make reliable and redundant data sharing Somebody Else's Problem (SEP). If they did a good job, you can benefit from them. Beware SPOFs in your local network

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 The Linux-HA Project Linux-HA is the oldest high-availability project for Linux, with the largest associated community Linux-HA is the OSS portion of IBM's HA strategy for Linux Linux-HA is the best-tested Open Source HA product The Linux-HA package is called “Heartbeat” (though it does much more than heartbeat) Linux-HA has been in production since 1999, and is currently in use on more than ten thousand sites Linux-HA also runs on FreeBSD and Solaris, and is being ported to OpenBSD and others Linux-HA shipped with every major Linux distribution except one. Release 2 shipped end of July – more than 6000 downloads since then

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 1 Applications Database Servers (DB2, Oracle, MySQL, others) Load Balancers Web Servers Custom Applications Firewalls Retail Point of Sale Solutions Authentication File Servers Proxy Servers Medical Imaging Almost any type server application you can think of – except SAP

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA customers FedEx FedEx – Truck Location Tracking BBC BBC – Internet infrastructure Oxfordshire Schools Oxfordshire Schools – Universal servers – an HA pair in every school The Weather Channel The Weather Channel (weather.com) Sony Sony (manufacturing) ISO New England ISO New England manages power grid using 25 Linux-HA clusters MAN Nutzfahrzeuge AG MAN Nutzfahrzeuge AG – truck manufacturing division of Man AG Karstadt, Circuit City Karstadt, Circuit City use Linux-HA and databases each in several hundred stores Citysavings Bank Citysavings Bank in Munich (infrastructure) Bavarian Radio Station Bavarian Radio Station (Munich) coverage of 2002 Olympics in Salt Lake City Emageon Emageon – medical imaging services Incredimail Incredimail bases their mail service on Linux-HA on IBM hardware University of Toledo (US) University of Toledo (US) – 20k student Computer Aided Instruction system

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 1 capabilities Supports 2-node clusters Can use serial, UDP bcast, mcast, ucast communication Fails over on node failure Fails over on loss of IP connectivity Capability for failing over on loss of SAN connectivity Limited command line administrative tools to fail over, query current status, etc. Active/Active or Active/Passive Simple resource group dependency model Requires external tool for resource (service) monitoring SNMP monitoring

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 capabilities Built-in resource monitoring Support for the OCF resource standard Much larger clusters supported (>= 8 nodes) Sophisticated dependency model Rich constraint support (resources, groups, incarnations, master/slave) XML-based resource configuration Coming in 2.0.x (later in 2005) Configuration and monitoring GUI Support for GFS cluster filesystem Multi-state (master/slave) resource support Monitoring of arbitrary external entities (temp, SAN, network)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Release 2 Credits Andrew Beekhof (SUSE) – CRM, CIB Gouchun Shi (NCSA) – significant infrastructure improvements Sun, Jiang Dong and Huang, Zhen – LRM, Stonithd and testing Lars Marowsky-Bree (NCSA) – architecture, leadership Alan Robertson – architecture, project leadership, original heartbeat code, testing, evangelism

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 1 Architecture

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 Architecture (add TE and PE)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 Architecture (more detail)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Resource Objects in Release 2 Release 2 supports “resource objects” which can be any of the following: Primitive Resources Resource Groups Resource Clones – “n” resource objects Multi-state (master/slave) resources

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Classes of Resource Agents in R2 (resource primitives) OCF – Open Cluster Framework - http://opencf.org/ take parameters as name/value pairs through the environment Can be monitored well by R2 Heartbeat – R1-style heartbeat resources Take parameters as command line arguments Can be monitored by status action LSB – Standard LSB Init scripts Take no parameters Can be monitored by status action Stonith – Node Reset Capability Very similar to OCF resources

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 An OCF primitive object Attribute nvpairs are translated into environment variables

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 An LSB primitive resource object (i. e., an init script)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 A STONITH primitive resource

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Resource Groups Resource Groups provide a shorthand for creating ordering and co-location dependencies Each resource object in the group is declared to have linear start-after ordering relationships Each resource object in the group is declared to have co- location dependencies on each other This is an easy way of converting release 1 resource groups to release 2

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Resource Clones Resource Clones allow one to have a resource object which runs multiple (“n”) times on the cluster This is useful for managing load balancing clusters where you want “n” of them to be slave servers Cluster filesystem mount points Cluster Alias IP addresses Cloned resource object can be a primitive or a group

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Sample clone XML

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Multi-State (master/slave) Resources (coming in 2.0.3) Normal resources can be in one of two stable states: running stopped Multi-state resources can have more than two stable states. For example: running-as-master running-as-slave stopped This is ideal for modeling replication resources like DRBD

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Basic Dependencies in Release 2 Ordering Dependencies start before (normally implies stop after) start after (normally implies stop before) Mandatory Co-location Dependencies must be co-located with cannot be co-located with

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Resource Location Constraints Mandatory Constraints: Resource Objects can be constrained to run on any selected subset of nodes. Default depends on setting of symmetric_cluster. Preferential Constraints: Resource Objects can also be preferentially constrained to run on specified nodes by providing weightings for arbitrary logical conditions The resource object is run on the node which has the highest weight (score)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Advanced Constraints Nodes can have arbitrary attributes associated with them in name=value form Attributes have types: int, string, version Constraint expressions can use these attributes as well as node names, etc in largely arbitrary ways Operators: =, !=,, = defined(attrname), undefined(attrname), colocated(resource id), not colocated(resource id)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Advanced Constraints (cont'd) Each constraint is associated with particular resource, and is evaluated in the context of a particular node. A given constraint has a boolean predicate associated with it according to the expressions before, and is associated with a weight, and condition. Weights can be constants – or attribute values. If the predicate is true, then the condition is used to compute the weight associated with locating the given resource on the given node. Conditions are given weights, positive or negative. Additionally there are special values for modeling must-have conditions +INFINITY -INFINITY The total score is the sum of all the applicable constraint weights

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Sample Dynamic Attribute Use Attributes are arbitrary – only given meaning by rules You can assign them values from external programs For example: Create a rule which uses the attribute fc_status as its weight for some resource needing a Fiber Channel connection Write a script to set the status of fc_status for a node to 0 if the FC connection is working, and -10000 if it is not Now, those resources automatically move to a place where the FC connection is working – if there is such a place, if not they stay where they are.

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 rsc_location information We prefer the webserver group to run on host node01

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Request for Feedback Linux-HA Release 2 is a good solid HA product At this point human and experience factors will likely more helpful than most technical doo-dads and refinements This audience knows more about that than probably any other similar audience in the world So, check out Linux-HA release 2 and tell us... What we got right What needs improvement What we got wrong We are very responsive to comments We look forward to your critiques, brickbats, and other comments

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 DRBD – RAID1 over the LAN DRBD is a block-level replication technology Every time a block is written on the master side, it is copied over the LAN and written on the slave side Typically, a dedicated replication link is used It is extremely cost-effective – common with xSeries Worst-case around 10% throughput loss Recent versions have very fast “full” resync

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Security Considerations Cluster: A computer whose backplane is the Internet If this isn't scary, you don't understand... You may think you have a secure cluster network You're probably mistaken now You will be in the future

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Secure Networks are Difficult Because... Security is not often well-understood by admins Security is well-understood by “black hats” Network security is easy to breach accidentally Users bypass it Hardware installers don't fully understand it Most security breaches come from “trusted” staff Staff turnover is often a big issue Virus/Worm/P2P technologies will create new holes especially for Windows machines

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Security Advice Good HA software should be designed to assume insecure networks Not all HA software assumes insecure networks Good HA installation architects use dedicated (secure?) networks for intra-cluster HA communication Crossover cables are reasonably secure – all else is suspect ;-)

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 References http://linux-ha.org/ http://linux-ha.org/Talks (these slides ) http://linux-ha.org/download/ http://linux-ha.org/SuccessStories http://linux-ha.org/Certifications http://linux-ha.org/BasicArchitecture http://linux-ha.org/NewHeartbeatDesign www.linux-mag.com/2003-11/availability_01.html

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Legal Statements IBM is a trademark of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. This work represents the views of the author and does not necessarily reflect the views of the IBM Corporation.

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project.

Similar presentations

Presentation on theme: "-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project.

Similar presentations

Presentation on theme: "-- Linux-HA Release 2 High-Availability Best Practices IV – October, 2005 Linux-HA Release 2 – An Overview Alan Robertson Project Leader – Linux-HA project."— Presentation transcript:

Similar presentations

About project

Feedback