Presentation is loading. Please wait.

Presentation is loading. Please wait.

VMware vCenter Server Fault Tolerance

Similar presentations


Presentation on theme: "VMware vCenter Server Fault Tolerance"— Presentation transcript:

1 VMware vCenter Server Fault Tolerance
John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential

2 Module 2 Lessons Lesson 1 – vCenter Server High Availability
Lesson 2 – vCenter Server Distributed Resource Scheduler Lesson 3 – Fault Tolerance Lesson 4 – Enhanced vMotion Compatibility Lesson 5 – DPM - IPMI Lesson 6 – vApps Lesson 7 – Host Profiles Lesson 8 – Reliability, Availability, Serviceability ( RAS ) Lesson 9 – Web Access Lesson 10 – vCenter Update Manager Lesson 11 – Guided Consolidation Lesson 12 – Health Status Agenda Overview VI4 - Mod Slide

3 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerance Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

4 Understanding VMware Fault Tolerance
The VMware Fault Tolerance (FT) feature creates a virtual machine configuration that can provide continuous availability. VMware Fault Tolerance (FT) is built on the ESX/ESXi 4.0 host platform. FT is provided using the Record/Replay functionality implemented in the VM monitor. VMware FT works by creating an identical copy of a virtual machine. One copy of the virtual machine, called the primary, is in the active state, receiving requests, serving information, and running applications. Another copy, called the secondary, receives the same input that is received by the primary. Additional internal information: https://wiki.eng.vmware.com/FaultToleranceMode https://wiki.eng.vmware.com/PlatformSolutions/QA/FT/FTSetup https://wiki.eng.vmware.com/RecordReplayEtAl VI4 - Mod Slide

5 Understanding VMware Fault Tolerance (ctd)
VI4 - Mod Slide

6 Understanding VMware Fault Tolerance (ctd)
VMware FT provides a higher level of business continuity than HA. In the case of FT, the secondary immediately comes on-line and all (or almost all) information about the state of the virtual machine is preserved. The state of the secondary machine is dependant on the latency & lag between the primary and secondary VMs. VMware FT does not require a Virtual Machine restart & applications and data stored in memory do not need to be re-entered or reloaded. VI4 - Mod Slide

7 Virtual Machine Record & Replay
Application Operating System Application Operating System Virtualization Layer Virtualization Layer RECORD REPLAY Logging causes of non-determinism Input (network, user), asynchronous I/O (disk, devices), CPU Timer interrupts Deterministic delivery of events previously logged Result = repeatable VM execution Describe a new functionality coming to the VMM. Ability to record execution and replay it later. Need to emphasize extract replay. Fault Tolerance is layered on a technology that VMware has developed over the last several years called Record/Replay. FT VMs point to the same storage (VMDK) as the original VM – no interrupts (e.g timers, network traffic) are sent to this FT (secondary) VM while the original VM is running. This slide has animation which builds. VI4 - Mod Slide

8 Virtual Machine Record & Replay (ctd)
For a given primary VM, FT runs a secondary VM on a different host. Sharing virtual disks with primary. Secondary VM kept in “virtual lockstep” via logging info sent over private network connection. Only primary VM sends and receives network packets, secondary is “Passive”. If primary host fails, secondary VM takes over with no interruption to applications. So the idea is to use record/replay technology to keep two VMs in virtual lockstep, so that we can deal with hardware failure…. We call the main VM the “primary VM” and the other…. VI4 - Mod Slide

9 FT in the VMkernel The FT vmkernel module is called vmklogger.
Log entries are put in the log buffer, which is flushed/filled asynchronously. Log entries are sent/received through socket on VMkernel NIC. There should be a dedicated VMkernel network for logging which has FT Logging enabled. vmkernel primary backup VI4 - Mod Slide

10 Determining Node Failure
FT does frequent heartbeat’ing through multiple NICs to determine when primary/backup hosts are down. Backup “goes live” and becomes new primary if it declares current primary dead We must have a method to distinguish a crashed host from a network failure (“split-brain”). Our method is to use an atomic operation (rename) on shared VMFS. Whenever primary/backup believes other host is down, it renames common file. Winner of rename “race” survives, loser of rename “race” commits suicide. VI4 - Mod Slide

11 Record/Replay and FT Requirements: ESX/HW
CPUs: Limited processors (AMD Barcelona+, Intel Penryn+), processors must be the same family (i.e. no mix/match) Hardware Virtualization must be enabled in the BIOS Hosts must be in an HA-enabled cluster Storage: shared storage (FC, iSCSI, or NAS) Network: minimum of 3 NICs for various types of traffic (ESX Management/VMotion, VM traffic, FT Logging) GigE required for VMotion and FT Logging Minimized single points of failures in the environment – i.e. NIC teaming, multiple network switches, storage multipathing Primary and secondary hosts must be running the same build of ESX VI4 - Mod Slide 11

12 VMware Fault Tolerance and HA Work Together
FT VM’s run only in an HA cluster Mission-critical VMs are protected by FT and HA, remaining VM’s protected by HA When a host fails: FT secondary takes over New FT secondary is started by HA HA-only VM’s are restarted X VMware FT VMware FT X VMware FT VMware HA X Please note that VM’s protected by FT are not handled by VMware HA for restart priority; it would be considered as “disabled” in the restart priority. This slide is an animation that builds Resource Pool VI4 - Mod Slide

13 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerance Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

14 Prerequisites for VMware Fault Tolerance
For VMware FT to perform as expected, it must run in an environment that meets specific requirements. The primary and secondary fault tolerant virtual machines must be in a VMware HA cluster. Primary and secondary ESX/ESXi hosts should be the same CPU model family. Primary and secondary virtual machines must not run on the same host. FT will automatically place the secondary VM on a different host. VI4 - Mod Slide

15 Prerequisites for VMware Fault Tolerance (ctd)
Storage Virtual machine files must be stored on shared storage. Shared storage solutions include NFS, FC, and iSCSI. For virtual disks on VMFS-3, the virtual disks must be thick, meaning they cannot be "thin" or sparsely allocated. Turning on VMware FT will automatically convert the VM to thick-eager zeroed disks. Virtual Raw Disk Mapping (RDM) is supported. Physical RDM is not supported. Shared disk FT opens the disks in multi-writer mode (The primary and secondary are sharing the base disk, which is flat, preallocated and in multiwriter mode). Mutliwriter mode doesn't support thin/sparse/lazy zeroed disks. If we didn't force disks to eagerzeroed thick, there would be a ~30s latency on failovers – Source: Eric Lowe VI4 - Mod Slide

16 Prerequisites for VMware Fault Tolerance (ctd)
Networking Multiple gigabit Network Interface Cards (NICs) are required. A minimum of two VMKernel Gigabit NICs dedicated to VMware FT Logging and VMotion. The FT Logging interface is used for logging events from the primary virtual machine to the secondary FT virtual machines. For best performance, use 10Gbit NIC rather than 1Gbit NIC, and enable the use of jumbo frames. VMkernel handles the network traffic for FT through the VMkernel network interface. VI4 - Mod Slide

17 Prerequisites for VMware Fault Tolerance (ctd)
Processor SMP Virtual Machines are not supported. Virtual Machines must be of the same CPU model family. Supported processors include the following: Intel Core 2, also known as Merom Intel 45nm Core 2, also known as Penryn. Intel Next Generation, also known as Nehalem. AMD 2nd Generation Opteron, also known as Rev E/F common feature set. AMD 3rd Generation Opteron, also known as Greyhound. VI4 - Mod Slide

18 Prerequisites for VMware Fault Tolerance (ctd)
Host BIOS VMware FT requires that Hardware Virtualization (HV) be turned on in the BIOS. The process for enabling HV varies among BIOS’es. If HV is not enabled, attempts to power on a primary copy of a fault tolerant virtual machine produces the following error message: "Fault tolerance requires that Record/Replay is enabled for the virtual machine. Module Statelogger power on failed." HV or Hardware Virtualization is a must for FT to function. Please note the error messages if they are not present. VI4 - Mod Slide

19 Prerequisites for VMware Fault Tolerance (ctd)
If HV is enabled for the ESX/ESXi host that is hosting a primary copy of a fault tolerant virtual machine, but not on any other hosts in the cluster, the primary can be successfully powered on. After the primary is powered, VMware FT automatically attempts to start the fault tolerant secondary. This fails after a brief delay and produces the following error message: "Secondary virtual machine could not be powered on as there are no compatible hosts that can accommodate it." The primary remains powered on in live mode, but fault tolerance is not established. If the other hosts do not have HV enabled but the primary does, the Primary VM is still powered on but will produce an error message. Note the last point. VI4 - Mod Slide

20 Prerequisites for VMware Fault Tolerance (ctd)
Turn off power-management (also known as power-capping) in the BIOS. If power management is left enabled, the secondary hosts may enter lower performance, power-saving modes. Such modes can leave the secondary virtual machine with insufficient CPU resources, potentially making it impossible for the secondary to complete all tasks completed on a primary in a timely fashion. Turn off hyperthreading in the BIOS. If hyperthreading is left enabled and the secondary virtual machine is sharing a CPU with another demanding virtual machine, the secondary virtual machine may run too slowly to complete all tasks completed on the primary in a timely fashion. The power savings mode will slow the secondary machine (see sub-point 1) causing (sub-point 2). Turning off hyperthreading would be considered a best practice to ensure top performace for the FT. Considerations would be made for customers who require hyperthreading and would be a PSO engagement. VI4 - Mod Slide

21 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerant VM’s Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

22 Setting Up Fault Tolerance
To enable Fault Tolerance, connect the vSphere client to the vCenter Server using an account with cluster administrator permissions. In the Hosts & Clusters view, select a Virtual Machine. Next, right mouse click > Fault Tolerance > Turn Fault Tolerance On If the Virtual Machines is stored on a thinly provisioned or eagerly scrubbed disk(s), those disk files must be converted to Thick-EagerZeroed before FT can be enabled. When FT is enabled, a message appears informing users of this requirement and of the fact that the conversion will be completed. The specified virtual machine is marked as a primary and a secondary is established on another host. FT is now enabled. Screenshot of this warning is on the next slide. VI4 - Mod Slide

23 Setting Up Fault Tolerance (ctd)
VI4 - Mod Slide

24 Setting Up Fault Tolerance (ctd)
A view of FT VM in the inventory. VI4 - Mod Slide

25 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerant VM’s Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

26 Viewing Information about Fault Tolerant VMs
Fault Tolerant VMs have an additional Fault Tolerance pane on their summary tab which provides information about the Fault Tolerance setup and performance. Fault Tolerance Status - Indicates the status of fault tolerance - Protected or Not Protected/Disabled. VI4 - Mod Slide

27 Viewing Information about Fault Tolerant VMs (ctd)
Secondary Location - Displays the ESX/ESXi host on which the secondary virtual machine is hosted. Total Secondary CPU - Indicates all secondary CPU usage, displayed in MHz. Total Secondary Memory - Indicates all secondary memory usage, displayed in MB. Secondary VM Lag Time shows the current delay between the primary and secondary VM. Log Bandwidth shows the consumed bandwidth on the link for Record/Replay operations between the primary and secondary VM. This value is based on the FT operations only, and is not the bandwidth usage on the wire (i.e with. TCP/IP/Ethernet headers). You cannot disable FT from the secondary node. VI4 - Mod Slide

28 FT Virtual Machine files
Before VM is FT enabled After VM is FT Enabled VI4 - Mod Slide

29 Maps View of an FT VM VI4 - Mod 2-3 - Slide
Notice here that both the original VM is shown (along with its host) and the FT VM is shown (along with its host). VI4 - Mod Slide

30 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerant VM’s Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

31 VMware FT Restrictions
Many VMware Infrastructure features and third-party products are supported for use with VMware FT, but the following features are not: Microsoft Cluster Services (MSCS): MSCS does its own failover and management. As a result, conflicts may arise with coexistence of VMware FT and MSCS solutions. Nested Page Tables/Extended Page Tables (NPT/EPT): A restriction of the record/replay implementation. This restriction does not affect the user experience. Record/replay for virtual machines automatically disables NPT/EPT, even though other virtual machines on the same host can continue to use these features. Paravirtualization: A restriction of the record/replay implementation. Record/replay does not work with paravirtualized guests. Hot-plugging devices: A restriction of the record/replay implementation. Users cannot hot add and remove devices. Automatic DRS recommendation application: For this release, an FT virtual machine can not be used with DRS, though manual VMotion is allowed. We need to add a description of NPT & EPT. For NPT (AMD), it is now called Rapid Virtualization Indexing. Read more on this at For EPT (INTEL), please read more on VI4 - Mod Slide

32 Features not supported with VMware FT
Symmetric multiprocessor (SMP) virtual machines. Storage VMotion. NPIV – N-Port ID Virtualization. NIC passthrough. Devices which do not have Record/Replay support such as USB and sound. Some network interfaces for legacy network hardware such as vlance. While some legacy drivers are not supported, VMware FT does revert to the supported vmxnet2 driver, thereby handling cases where vlance would otherwise be required. Virtual Machine snapshots. NIC pass-through will be covered in the Networking Module and is initially covered in module 0. VI4 - Mod Slide

33 Fault Tolerance Best Practices
Ratio of Fault Tolerant VMs to ESX/ESXi hosts Maintaining consistency between primary and secondary fault tolerant virtual machines makes significant use of disk and network resources. You should have no more than four to eight fault tolerant virtual machines, primaries or secondaries on any single host. The number of fault tolerant virtual machines that you can safely run on each host cannot be stated precisely because the number is based on the ESX/ESXi host and virtual machine size and workload factors, all of which can vary widely. What are the configuration maximums on a cluster level instead of a per host level? (up to 32 hosts are in a cluster [vSphere 4] * 8) VI4 - Mod Slide

34 Fault Tolerance Use Cases
Several typical situations that can benefit from the use of VMware FT. For example: Any application that needs to be available at all times. This especially applies to applications that have long-lasting client connections that users want to maintain during hardware failure. Custom applications that have no other way of doing clustering. Cases where high availability might be provided through MSCS, but MSCS is too complicated to configure and maintain. VI4 - Mod Slide

35 Module 2-3 Lessons Lesson 1 – Understanding Fault Tolerance
Lesson 2 – Prerequisites for Fault Tolerant VM’s Lesson 3 – Setting up Fault Tolerance Lesson 4 – Viewing information about Fault Tolerant VM’s Lesson 5 – Fault Tolerant Guidelines Lesson 6 – Troubleshooting Fault Tolerance Agenda Overview VI4 - Mod Slide

36 Primary vmware.log FT Startup Messages
Mar 04 15:40:41.556: vmx| MigrateStateUpdate: Transitioning from state 0 to 1. Mar 04 15:40:41.557: vmx| Migrating to become primary Mar 04 15:40:41.557: vmx| StateLogger_MigrateStart: VMotion srcIp , dstIp Mar 04 15:40:41.557: vmx| StateLogger_MigrateStart: Logging srcIp , dstIp Mar 04 15:40:49.538: vmx| VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##1_202/op/=start Mar 04 15:40:49.539: vmx| VmxMigrateGetStartParam: mid= b562 dstwid=4953 Mar 04 15:40:49.539: vmx| Received migrate 'start' request for mig id , dest world id Mar 04 15:40:49.541: vmx| MigrateStateUpdate: Transitioning from state 1 to 2. Mar 04 15:40:49.817: vcpu-0| MigrateStateUpdate: Transitioning from state 2 to 3. Mar 04 15:40:49.818: vcpu-0| Migrate: Preparing to suspend. Mar 04 15:40:49.819: vcpu-0| Migrating a secondary VM Mar 04 15:40:49.819: vcpu-0| CPT current = 0, requesting 1 Mar 04 15:40:49.819: vcpu-0| Migrate: VM stun started, waiting 8 seconds for go/no-go message. ... *** Please note that these slides are taken from multiple different FT VMs so you will see differing VMotion IDs and FT Logger IDs through out this section *** This slide shows the VM powering on & migrating to other host to create secondary VM. Historically when a VMotion completed, we deleted the original VM. With FT, we go through all the same steps except that we do not delete the original VM when the migration completes. VMotion is used on an FT VM when: a secondary/primary vm can be VMotioned to other hosts. when failover happens, secondary vm needs to be respawned with FT VMotion. Detailed FT operation: https://wiki.eng.vmware.com/PlatformSolutions/RecordReplay/FT/FT_Design_Doc VI4 - Mod Slide

37 Primary vmware.log FT Startup Messages (ctd)
Mar 04 15:40:49.852: vmx| Migrate_Open: Migrating to < > with migration id Mar 04 15:40:49.852: vmx| Checkpointed in VMware ESX, build , build , Linux Host Mar 04 15:40:49.853: vmx| BusMemSample: checkpoint 3 initPercent 75 touched Mar 04 15:40:49.854: vmx| FT saving on primary to create new backup Mar 04 15:40:49.889: vmx| Connection accepted, ft id Mar 04 15:40:49.892: vmx| STATE LOGGING ENABLED (interponly 0 interpbt 0) Mar 04 15:40:49.893: vmx| LOG data ... Mar 04 15:40:50.275: vmx| Migrate: VM successfully stunned. Mar 04 15:40:50.276: vmx| MigrateStateUpdate: Transitioning from state 3 to 4. Mar 04 15:40:50.890: vmx| MigrateSetStateFinished: type=1 new state=5 Mar 04 15:40:50.890: vmx| MigrateStateUpdate: Transitioning from state 4 to 5. Mar 04 15:40:50.891: vmx| StateLogger_MigrateSucceeded: Backup connected Mar 04 15:40:50.891: vmx| Migrate: Attempting to continue running on the source. Mar 04 15:40:50.893: vmx| CPT current = 3, requesting 6 ... Mar 04 15:40:50.915: vmx| Continue sync while logging or replaying 8428 Mar 04 15:40:50.924: vmx| Migrate: cleaning up migration state. Mar 04 15:40:50.924: vmx| MigrateStateUpdate: Transitioning from state 5 to 0. Statelogger notes the migration succeeded Unstuns and continues to run on the source VM It also prints the 'ft id' which can be sync'd to the 'ft id' in the secondary's vmware.log file and can be used to identify which stateLogger vmkernel messages correspond to this FT pair (again, assuming they have more than one FT pair enabled). VI4 - Mod Slide

38 Migration Transition States - Primary
Base state.  No migration currently in progress. MIGRATE_VMX_NONE - state 0 VMX has received a MIGRATE_TO message.  Waiting for the start message along with the world ID of the destination. MIGRATE_TO_VMX_READY – state 1 VMX has received a MIGRATE_START message.  Precopying data to destination. MIGRATE_TO_VMX_PRECOPY – state 2 Precopy done.  Saving checkpoint. MIGRATE_TO_VMX_CHECKPT – state 3 Done saving checkpoint.  Waiting for acknowledgement from destination that the VMX started.  Until the acknowledgement is received, the migration may still fail back to the source. MIGRATE_TO_VMX_WAIT_HANDSHAKE – state 4 Migration succeeded or failed.  On success, VMX process needs to power down and cleanup.  On failure, VM will continue running and be ready for the next migration operation after this state passes. MIGRATE_TO_VMX_FINISHED – state 5 These are taken from an intercept of the code – typedef enum MigrateVmxState - VI4 - Mod Slide

39 Migration Transition States - Secondary
Base state.  No migration currently in progress. MIGRATE_VMX_NONE - state 0 VMX has received a MIGRATE_FROM message.  Getting ready to receive VM. MIGRATE_FROM_VMX_INIT – state 7 VMX is ready and waiting for source to send VM data. MIGRATE_FROM_VMX_WAITING – state 8 Both memory and checkpoint data is being copied to destination. MIGRATE_FROM_VMX_PRECOPY – state 9 Data was precopied.  Restoring checkpoint. MIGRATE_FROM_VMX_CHECKPT – state 10 Migration succeeded or failed.  On success, VMX process runs migrated VM. After state passes, VMX is ready for next migration operation.  On failure, VM will power down and cleanup. MIGRATE_FROM_VMX_FINISHED – state 11 State 6 in unused VI4 - Mod Slide

40 FT Troubleshooting – Primary vmkernel logs
Immediately following the FT migration you will see messages like these on the ESX. You will want to note the migration ID & the statelogger ID in the case where there are many FT VMs: Primary: Mar  4 10:51:35 prme-stft053 vmkernel: 0:16:24: cpu2:4281)VMotion: 2582: S: Stopping pre-copy: only pages were modified, which can be sent within the switchover time goal of seconds (network bandwidth ~ MB/s) Mar  4 10:51:35 prme-stft053 vmkernel: 0:16:24: cpu3:4280)VSCSI: 5850: handle 8193(vscsi0:0):Destroying Device for world 4281 (pendCom 0) Mar  4 10:51:36 prme-stft053 vmkernel: 0:16:24: cpu7:4230)VMKStateLogger: 6856: : accepting connection from secondary at Note the S against the VMotion ID – S for source VI4 - Mod Slide

41 FT Troubleshooting – Secondary vmkernel logs
Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53: cpu2:4286)VMotion: 1805: D: Set ip address ' ' worldlet affinity to recv World ID 4289 Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53: cpu7:4228)MigrateNet: vm 4228: 1096: Accepted connection from < > Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53: cpu7:4228)MigrateNet: vm 4228: 1110: dataSocket 0x4100b6092e60 send buffer size is Mar  4 10:51:35 prme-stft057 vmkernel: 0:19:53: cpu3:4289)VMotionRecv: 226: D: Estimated network bandwidth MB/s during pre-copy Mar  4 10:51:36 prme-stft057 vmkernel: 0:19:53: cpu7:4286)VSCSI: 3469: handle 8193(vscsi0:0):Creating Virtual Device for world 4287 (FSS handle ) Mar  4 10:51:36 prme-stft057 vmkernel: 0:19:53: cpu7:4286)VMKStateLogger: 1949: :  Connected to primary Note the D against the VMotion ID – D for destination The statelogged ID for each FT pair is derived from the VMotion migration ID. VI4 - Mod Slide

42 FT Troubleshooting – vmware.log
The FT pair ID (logged from the StateLogger vmkernel module to identify the FT pair) is also found in the vmware.log file. This is an example of a secondary who's primary died: Mar 03 20:03:56.457: vcpu-0| StateLoggerSetEndOfLog: BCnt: fSz: 0 bufPos ... Mar 03 20:03:56.464: vmx| Preparing to go live ... Mar 03 20:03:56.503: vmx| Done going live Mar 03 20:03:56.503: vmx| Failover initiated via vmdb Mar 03 20:03:56.504: vmx| Gone live because of Lost connection to primary. Mar 03 20:03:56.506: vmx| Unstunning after golive ... Mar 03 20:04:08.199: vmx| FT saving on primary to create new backup Mar 03 20:04:08.203: vmx| Connection accepted, ft id The FT Pair ID will be needed when troubleshooting customers with multiple FT enabled VMs on the hosts to differentiate the messsages. Noteworthy events- 1) Statelogger reaches end of log 2) Go Live events 3) Migration to create new secondary (left out of snip) 4) The new FT ID for the newly established FT pair VI4 - Mod Slide

43 FT Troubleshooting – Split Brain
For support's purposes, the vmkernel log files will display messages similar to the following on the host running the VM that lost the race for the generation file (and thus did not golive): Mar  4 10:52:45 prme-stft057 vmkernel: 0:19:54: cpu2:4291)VMKStateLogger: 7823: Rename of .ft-generation2 to .ft-generation3 failed: Not found Mar  4 10:52:45 prme-stft057 vmkernel: 0:19:54: cpu2:4291)VMKStateLogger: 2792: : Can *NOT* golive On the host running the VM that did win the race and successfully renamed the file (and did golive) you will see a corresponding message: Mar  4 10:52:45 prme-stft053 vmkernel: 0:16:25: cpu6:4283)VMKStateLogger: 2792: : Can golive The other thing you'll want to note is the statelogger ID if there are multiple FT enabled VMs. The file renaming is an atomic operation supported by VMFS and only one issuer can succeed, when both try to rename a file, say from "N" to "N+1". In order to see who wins the race, you will need to check the VC GUI to see the vm is alive on which host. The common file does not give any information on that as of current implementation. From  https://wiki.eng.vmware.com/PlatformSolutions/RecordReplay/FT/FT_Design_Doc Section " In order to ensure that we don't have a primary and backup simultaneously trying to be the primary during a network partition, we use an on-disk FT generation number file. Before a primary launches the first backup it creates the generation number file generation.N. When a backup connects to the primary the primary sends it generation number N. Once the primary or backup determines that a failure of the other half of the FT pair, it will try to rename the file from generation.N to generation.N+1. If the rename succeeds, then it can take the appropriate failure recovery action and become or remain the primary VM. If the rename fails, then the other VM must have already done recovery, so the current VM commits suicide. There is no possibility of disk corruption before a network partition is discovered, because the primary cannot write to the disk without receiving an ACK from the backup." VI4 - Mod Slide

44 VMware SiteSurvey Tool
We have created a new utility which analyzes a cluster of ESX hosts and tells you whether the configuration is suitable for FT. This includes checking for FT-compatible processors, shared storage, BIOS settings, etc. The utility is called VMware SiteSurvey and a Beta copy is available in the "Documents" tab. To use it, download the VMware SiteSurvey executable from that page and run it, which will install the utility on your local Windows machine. Detail at VI4 - Mod Slide

45 VMware SiteSurvey Tool (ctd)
This is a sample report output. VI4 - Mod Slide

46 Troubleshooting Fault Tolerance
When attempting to power on a virtual machine with VMware FT enabled, an error message may appear in a pop-up dialog box. "Fault tolerance requires that Record/Replay is enabled for the virtual machine. Module Statelogger power on failed.“ What is a possible root cause? This is often the result of Hardware Virtualization (HV) not being available on the ESX/ESXi server on which you are attempting to power on the virtual machine. HV may not be available either because it is not supported by the ESX/ESXi server hardware or because HV is not enabled. VI4 - Mod Slide

47 Troubleshooting Fault Tolerance (ctd)
After powering on a virtual machine with VMware FT enabled, an error message may appear in the Recent Task Pane. "Secondary virtual machine could not be powered on as there are no compatible hosts that can accommodate it.“ What is a possible root cause? This is typically because there are no other hosts in the cluster or there are no other hosts with HV enabled. If there are insufficient hosts, add more hosts to the cluster or accept the lack of fault tolerance. If there are hosts in the cluster, ensure they support HV and that HV is enabled. VI4 - Mod Slide

48 Troubleshooting Fault Tolerance (ctd)
When selecting a VM to enable Fault Tolerance, you find that the ‘Turn on Fault Tolerance’ option is greyed out. What are the possible causes? The host on which the Virtual Machine resides is not part of a VMware HA Cluster. The host on which the Virtual Machine resides does not have Hardware Virtualization turned on in the BIOS for the CPUs. The Virtual Machine does not support VMware Fault Tolerance. Update the virtual machine to a more recent version. The Virtual Machine has snapshots. Delete any snapshots. VI4 - Mod Slide

49 Lesson 2-3 Summary vSphere 4.0 introduce a new concept called Fault Tolerance. This enhances the VM availability that we had with VMware HA in so far as there is no downtime on the VM when a hardware failure occurs on the ESX host. However in this initial release, there are a number of restrictions placed on the VM configuration if it wishes to use FT. VI4 - Mod Slide

50 Lesson 2-3 - Lab 1 Lab 1 involves creating Fault Tolerant VM’s
Create a Fault Tolerant VM Watch a Fault Tolerant VM failover to another host Fault Tolerant VM settings VI4 - Mod Slide


Download ppt "VMware vCenter Server Fault Tolerance"

Similar presentations


Ads by Google