Presentation is loading. Please wait.

Presentation is loading. Please wait.

Module 7: VPLEX Troubleshooting

Similar presentations


Presentation on theme: "Module 7: VPLEX Troubleshooting"— Presentation transcript:

1 Module 7: VPLEX Troubleshooting
Define data collection processes Perform basic troubleshooting on various VPLEX elements Collect and analyze data to resolve connectivity errors This module focuses on troubleshooting a VLEX environment. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

2 Module 7: VPLEX Troubleshooting
Lesson 1: Data Collection Use logs and debugging tools to troubleshoot VPLEX Capture and analyze VPLEX log and configuration files This lesson covers collecting VPLEX troubleshooting data. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

3 VPLEX Logs, Cores, and Configuration Data
VPlexcli:/> collect-diagnostics :23:32 UTC: ****Initializing collect-diagnostics... :23:34 UTC: No cluster-witness server found. :23:34 UTC: Free space = 43.32GB :23:34 UTC: Total space needed = 5.43GB :23:34 UTC: ****Starting collect-diagnostics, this operation might take a while... :23:36 UTC: ****Starting Remote Management server sms dump log collection, this operation might take a while... Initiating sms dump... No such file or directory: /var/log/allmessages No such file or directory: /var/log/rsyncd.log No such file or directory: /var/log/syslog Warning: Could not access source directory /opt/emc/VPlex/Event_Msg_Folder No such file or directory: /etc/snmp/VplexSnmpPerfConfig.xml :24:10 UTC: ****Executing debugTowerDump on directors... :24:16 UTC: ****Completed debugTowerDump on directors. :24:16 UTC: ****Executing appdump on directors... :24:26 UTC: ****Completed appdump on directors. :24:26 UTC: ****Executing connectivity director on directors... :24:27 UTC: ****Completed connectivity director on directors. :24:27 UTC: ****Executing getsysinfo on directors... :24:32 UTC: ****Completed getsysinfo on directors. :24:33 UTC: ****Executing command export storage-view show-powerpath-interfaces --verbose on clusters... :24:33 UTC: ****Completed command export storage-view show-powerpath-interfaces --verbose on clusters. The collect-diagnostics command collects logs, cores, and configuration information from the Management Server and directors. This command produces a tar.gz file. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

4 collect-diagnostics Collects VPLEX log files
Places logs in: /diag/collect-diagnostics-out Extended Log: <tla>-diagnostics-extended-<time-stamp>.tar.gz Base Log: <tla>-diagnostics-<time-stamp>.tar.gz Collects all VPLEX logs including logs from the VPLEX Witness VPlexcli:/> collect-diagnostics :14:55 UTC: ****Initializing collect-diagnostics... :14:55 UTC: No cluster-witness server found. :14:55 UTC: Free space = K :14:55 UTC: Total space needed = K :14:55 UTC: ****Starting collect-diagnostics, this operation might take a while... Thread-dump written to /diag/collect-diagnostics-tmp/debug-thread-output.txt Initiating sms dump... No such file or directory: /var/log/allmessages No such file or directory: /var/log/rsyncd.log No such file or directory: /var/log/syslog No such file or directory: /home/service/.b The collect-diagnostics command collects all of the logs for the VPLEX Management Server, directors, and now with v5.0 the VPLEX Witness log files. In 5.0 it also collects cache vaulting information via the vault dump command. When run, it creates two log files, one extended and one base. Customers should collect both files. Since the extended file is expected to be very large, the base collect-diagnostics file should be transferred to support first while waiting for the extended file to transfer. The log files are retrieved from all directors in a VPLEX Metro or Geo unless the --local option is used. The extended log file contains a Java heap dump, a fast trace dump, cores (if exist), and performance sink files (if exist) and the base log file contains everything else. Please note, do not run the collect-diagnostics command simultaneously on both Management Servers. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

5 VPLEX Configuration Data
VPlexcli:/> cluster configdump -c cluster-1 -f /var/log/VPlex/cli/configdump_9_15_12.xml Initiating cluster configdump... VPlexcli:/> The cluster configdump command creates a dump file of the configuration information for a particular cluster. This configuration data consists of I/O port configurations, disk information including paths from directors to each storage volume. It also contains device configuration and capacity data, volume configuration, initiators, storage view configuration, and system volume information. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

6 Diagnostic TAR Location
cd /diag/collect-diagnostics-out/ ll total -rw-r--r-- 1 service users :15 FNM diagnostics tar.gz -rw-r--r-- 1 service users :14 FNM diagnostics-extended tar.gz The compressed TAR files are located in the /diag/collect-diagnostics-out folder. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

7 collect-diagnostics.log :27:54 UTC: ****Executing cluster status... {'MALVPLEX02': 'operational-status': ok, 'health-state': ok, 'transition-indications': [], 'transition-progress': [], 'health-indications': []}, 'MALVPLEX01': 'health-indications': []} } :27:54 UTC: ****Completed cluster status. The collect-diagnostics log is useful to determine that the collections completed successfully. You also have the core results of the diagnostic collection shown in this file. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

8 collect-diagnostics.log :27:54 UTC: ****Executing cluster summary... Clusters: Name Cluster ID Connected Expelled Operational Status Health State MALVPLEX true false ok ok MALVPLEX true false ok ok Islands: Island ID Clusters MALVPLEX01, MALVPLEX02 :27:54 UTC: ****Completed cluster summary. Here we see the collection of the cluster summary. The cluster is correctly operating at this time. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

9 collect-diagnostics.log /engines/engine-1-1/directors/director-1-1-A/hardware/ports: Name Address Role Port Status A0-FC00 0x af700 front-end up A0-FC01 0x af701 front-end up A0-FC02 0x af702 front-end up A0-FC03 0x af703 front-end up A1-FC00 0x af710 back-end up A1-FC01 0x af711 back-end up A1-FC02 0x af712 back-end up A1-FC03 0x af713 back-end up A2-FC00 0x af720 wan-com up A2-FC01 0x af721 wan-com up A2-FC02 0x af722 wan-com no-link A2-FC03 0x af723 wan-com no-link A3-FC00 0x af730 local-com up A3-FC01 0x af731 local-com up A3-FC02 0x down A3-FC03 0x down Review the port status and ensure the ports are up and operating correctly. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

10 Health-Check Overview
VPlexcli:/> health-check Product Version: Clusters: Cluster Cluster Oper Health Connected Expelled Name ID State State cluster ok ok True False cluster ok ok True False Meta Data: Cluster Volume Volume Oper Health Active Name Name Type State State cluster-1 Pod7Meta_backup_2012Aug29_ meta-volume ok ok False cluster-1 logging_Pod09Logging_vol logging-volume ok ok cluster-1 Pod09Meta meta-volume ok ok True cluster-1 Pod7Meta_backup_2012Aug30_ meta-volume ok ok False cluster-1 Pod09Meta_backup_2012Sep07_ meta-volume ok ok False cluster-1 Pod09Meta_backup_2012Sep08_ meta-volume ok ok False The new health-check command conducts a high-level scan of the cluster and displays health status and configuration information. Administrators can quickly attain a synopsis of the current health state of a cluster after receiving a call-home alert. The command consolidates several other VPLEX CLI status and summary commands into one view, such as the version, cluster status, cluster summary, ds summary, storage-volume summary, export storage-volume summary, virtual-volume summary, connectivity validate-be, connectivity validate-fe, and ll /clusters/**/system-volumes commands. Metrics are displayed for clusters , meta data , front end , and storage . The health-check command is useful in root-cause analysis (RCA) as well as determining the severity of an issue and its affect on the overall health of the cluster. A typical scenario under which the health-check command should be used is upon receipt of the call-home event. The call-home event would give an event code and a short description. An administrator could use the health-check command as one of the tools to determine RCA and the overall impact after receiving this call-home notification. Health-Check is a good starting-point, but may not be conclusive. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

11 Validate the VPLEX Back-end
VPlexcli:/> connectivity validate-be Summary Cluster cluster-1 0 storage-volumes which are dead or unreachable. 0 storage-volumes which do not meet the high availability requirement for storage volume paths*. 0 storage-volumes which are not visible from all directors. 0 storage-volumes which have more than supported (4) active paths from same director. *To meet the high availability requirement for storage volume paths each storage volume must be accessible from each of the directors through 2 or more VPlex backend ports, and 2 or more Array target ports, and there should be 2 or more ITLs. Cluster cluster-2 0 storage-volumes which have more than supported (4) active paths from same The connectivity validate-be command will verify the back-end connectivity of all VPLEX directors. It ensures that all directors have at least two paths to each storage volume. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

12 Validate the VPLEX System
VPlexcli:/> validate-system-configuration Validate cache replication Checking cluster cluster-1 ... rmg component not found skipping the validation of cache replication. ok Validate logging volume No errors found Validate back-end connectivity Cluster cluster-1 0 storage-volumes which are dead or unreachable. 0 storage-volumes which do not meet the high availability requirement for storage volume paths*. 0 storage-volumes which are not visible from all directors. 0 storage-volumes which have more than supported (4) active paths from same director. *To meet the high availability requirement for storage volume paths each storage The validate-system-configuration command performs some basic system checks to ensure that the cluster is working properly. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

13 Module 7: VPLEX Troubleshooting
Lesson 1: Summary During this lesson the following topics were covered: How to use logs and debugging tools to troubleshoot VPLEX How to capture and analyze VPLEX log and configuration files This lesson covered how to collect VPLEX troubleshooting data. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

14 Module 7: VPLEX Troubleshooting
Lesson 2: VPLEX System Volumes Analyze and correct VPLEX system volume failures List the components and steps to restore a Meta volume This lesson covers troubleshooting VPLEX system volumes. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

15 Recovery After a Metadata Volume Failure
VPlexcli:/clusters/cluster-1/storage-elements/storage-volumes> meta-volume backup VPD83T3: ee5c628f984ae011 VPlexcli:/clusters/cluster-1/storage-elements/storage-volumes> ll /**/system-volumes /clusters/cluster-1/system-volumes: Name Volume Type Operational Health Active Ready Geometry Block Block Capacity Slots Status State Count Size cluster-1-log_vol logging-volume ok ok raid K G pod3_meta meta-volume ok ok true true raid K G pod3_meta_backup_2011Mar10_ meta-volume ok ok false true raid K G VPlexcli:/clusters/cluster-1/storage-elements/storage-volumes> meta-volume move -t pod3_meta_backup_2011Mar10_160622 pod3_meta meta-volume ok ok false true raid K G pod3_meta_backup_2011Mar10_ meta-volume ok ok true true raid K G VPlexcli:/clusters/cluster-1/storage-elements/storage-volumes> meta-volume destroy pod3_meta Meta-volume 'pod3_meta' will be destroyed. Do you wish to continue? (Yes/No) y If the VPLEX loses access to its metadata volume the system will not allow any changes to occur. If the metadata volume cannot be recovered, but the system is still running, a new metadata volume can and should be created immediately using the metavolume backup command. The metadata volume must be of the same or greater in size. Currently EMC recommends using 78 GB sized device for the metadata volume. The metavolume move command makes the backed up metadata volume the active metadata volume. The old metadata volume should be destroyed. If a full reboot occurs, the system could start using the old metadata volume. If the Metadata volume can not be recovered, but the system is still running: Create a backup metadata volume. Make the backup metadata volume the active metadata volume. Destroy the old metadata volume. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

16 Recovery After a Metadata Volume Failure
Coinciding with a system failure, the metadata volume fails and cannot be recovered Restart the cluster with a backup metavolume Activate a backup copy of the metadata volume you created prior to the failure meta-volume activate <target volume> VPlexcli:/clusters/cluster-1/storage-elements/storage-volumes> meta-volume activate -t pod3_meta_backup_2011Mar10_ –f The system metadata will be restored with metadata backup in the target meta-volume. All metadata collected after the backup will be discarded. Continue? (Yes/No) Yes If a VPLEX cluster fails and the metadata volume cannot be recovered, the cluster should be restarted with a backup metadata volume. Activate a backup copy of the metadata volume created prior to the failure. It is a good practice to create a backup on a regular basis to minimize or eliminate losing any configuration changes made since the backup was created; otherwise configuration changes made since a backup was created will be lost. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

17 Replace an Array: List the Metadata Volume Components
Mirror Mirror Metadata Volume Old Array New Array VPlexcli:/clusters/cluster-1/system-volumes/c1_meta/components> ll Name Slot Type Operational Health Capacity Number Status State VPD83T3: f01e006af7433c574de storage-volume ok ok G VPD83T3: d storage-volume ok ok G There will come a time when one of the arrays needs to be replaced with a new array. This means that all of the volumes will need to be migrated over to the new array. This process is fairly easy when going from a Symmetrix array to a new Symmetrix VMAX array as the administrator can use the Symmetrix VMAX Federated Live Migration (FLM) feature to move all of the storage volumes non-disruptively from the older Symmetrix array to the new VMAX Symmetrix array. This is non-disruptive to VPLEX as VPLEX sees the same WWNs and Storage Volumes as it did previously during the operation. For more information on VMAX FLM please see the EMC Solutions Enabler Symmetrix Migration CLI guide on Powerlink. Customers can also use VPLEX mobility to move volumes from one array to the other array if desired. If the migration is not from a Symmetrix array to Symmetrix VMAX array the user should use VPLEX Mobility to move the user data volumes to the new array. However, this still leaves the VPLEX system volumes (Metadata and logging). Listed here are the current legs (Storage Volumes) of the VPLEX Metadata volume. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

18 Replace an Array: Attach a New Metadata Volume Mirror
Old Array New Array VPlexcli:/> meta-volume attach-mirror –v c1_meta –d VPD83T3: f01e00ee de111 VPlexcli:/clusters/cluster-1/system-volumes/c1_meta/components> rebuild status [1] storage_volumes marked for rebuild Global rebuilds: No active global rebuilds. cluster-1 local rebuilds device rebuild type rebuilder director rebuilt/total percent finished throughput ETA c1_meta full s1_48d6_spa G/78G % 22M/s 56.4 min Once the user volumes have been moved over to the new array, the administrator can use the meta-volume attach-mirror command to attach another mirror to the existing VPLEX Metadata volume. This will cause a rebuild as the VPLEX synchronizes the Metadata volume mirror. This process is non-disruptive to the user. The rebuild status command can be used to display the status of the rebuild. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

19 Replace an Array: Rebuild the New Metadata Volume Mirror
VPlexcli:/clusters/cluster-1/system-volumes/c1_meta/components> ll Name Slot Type Operational Health Number Status State VPD83T3: f01e006af7433c574de storage-volume ok ok VPD83T3: f01e00ee de storage-volume error critical-failure VPD83T3: d storage-volume ok ok VPD83T3: f01e00ee de storage-volume ok ok The new mirror will be listed in the components section of the VPLEX Metadata volume. The mirror will be displayed in an error state until it is completely synchronized with the other two mirrors. Once the mirror is fully synchronized its health state and operational state will display ‘OK’. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

20 Replace an Array: Remove the Old Metadata Volume Mirror
New Array VPlexcli:/clusters/cluster-1/system-volumes/c1_meta/components> ll Name Slot Type Operational Health Capacity Number Status State VPD83T3: f01e006af7433c574de storage-volume ok ok G VPD83T3: f01e00ee de storage-volume ok ok G Once the new mirror has been synchronized, the old mirror can be removed by running the meta-volume detach-mirror command. The components directory should now be updated with the new volumes. Shown here the old leg has been removed. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

21 Replace an Array: Migrate the Logging Volumes
Mirror Old Array New Array New Logging Volume Moving a mirrored logging volume to a different array is slightly different than moving a VPLEX Metadata volume. To migrate a mirrored logging volume to a new array a new mirrored logging volume should be first created between the new array and the existing array. Once the new logging volume has been created the distributed devices need to be associated to it. This is accomplished by running the ds dd set-log command. Once the distributed devices are associated to the new logging volume, the old logging volume can be destroyed by running the logging-volume destroy command. VPlexcli:/clusters/cluster-1/system-volumes> ds dd set-log –d dd –l New-log_vol logging-volume destroy log-1_vol Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

22 Replace an Array: Destroy a Logging Volume
Vplexcli:/> logging-volume destroy New-log_vol/ logging-volume destroy: Evaluation of <<logging-volume destroy New-log_vol/>> failed. cause: Failed to destroy logging-volume ‘New-log_vol’. cause: Unable to destroy logging-volume ‘New-log_vol’. cause: Firmware command error cause: One or more controllers has the virtual volume of the device at A logging volume that is associated with a distributed device cannot be destroyed. Here the administrator tried to destroy a logging volume that was associated with a distributed devices and was not successful. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

23 Module 7: VPLEX Troubleshooting
Lesson 2: Summary During this lesson the following topics were covered: How to analyze and correct VPLEX system volume failures How to list the components and steps to restore a Meta volume This lesson covered how to troubleshoot system volumes. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

24 Module 7: VPLEX Troubleshooting
Lesson 3: VPLEX Connectivity List the connectivity components in VPLEX Define the logs and commands used to troubleshoot VPLEX connectivity Perform basic troubleshooting operations to determine and correct root cause analysis This lesson covers VPLEX connectivity issues. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

25 Latency Between Directors: director ping
VPN Connection VPlexcli:/> director ping —n director—1—1—A —i Round—trip time to : O.671 ms VPlexcli:/> director ping —n director—1—1—A —i Round—trip time to : O.114 ms VPlexcli:/> director ping —n director—1—1—A —i Round—trip time to : 0.84 ms VPlexcli:/> director ping —n director—1—1—B —i Round—trip time to : ms A VPLEX Metro has a latency requirement of < 5 ms round trip and a VPLEX Geo has a latency requirement of < 50 ms. If the administrator wishes to test the round trip latency between the clusters after installation he or she can run the director ping command. The director ping command allows the user to specify the director they wish to ping from and the IP address of target director they wish to ping. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

26 VPLEX Local IP Layout Management Server Management Port for B-side network: Service Port: Management Port for A-side network: Public LAN port: Customer Assigned IP Address FC Switch 1 IP Address: Engine 4, Cluster 1 Director A IP Addresses: , Director B IP Addresses: , Engine 3, Cluster 1 Director A IP Addresses: , Director B IP Addresses: , Engine 2, Cluster 1 Director A IP Addresses: , Director B IP Addresses: , Engine 1, Cluster 1 Director A IP Addresses: , Director B IP Addresses: , FC Switch 2 IP Address: FC 2 FC 1 MS Engine 1 Engine 2 Engine 3 Engine 4 This diagram shows the Ethernet Management Server Connections. It also show the internal IP addresses that the cluster-1 which uses a cluster number of 1. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

27 Test the IP Connectivity
ping -c 3 -I eth PING ( ) from eth0: 56(84) bytes of data. 64 bytes from : icmp_seq=1 ttl=63 time=0.563 ms 64 bytes from : icmp_seq=2 ttl=63 time=0.397 ms 64 bytes from : icmp_seq=3 ttl=63 time=0.550 ms ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.397/0.503/0.563/0.077 ms If pings do not work, try a different protocol such as ssh C1: ssh ssh C2: ssh ssh Ping all internal IP address combinations (management, .252 and the failover, .253) to the remote cluster in order to completely validate whether VPN is working. Make sure that you run all commands below as they will reveal either symmetric or asymmetric nature of the connectivity loss. Reference the back of the Installation and Setup and/or Configuration Guide for a complete list of the VPLEX IP Addresses. Management-server-2:~> ping -c <number of times to ping> -I eth<port #> <public IP address of the remote MS> Reference emc for how to get the remote management server's IP Address. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

28 Latency Between Directors: director tracepath
Provides route information from a director to a given target machine Displays round-trip latency and MTU of each hop to the destination VPlexcli:/> director tracepath -n director-1-1-B -i Destination reachable. 3 hops: Source: endpoint , latency 0.044ms, mtu 1500, reachable Hop 1: endpoint , latency 0.151ms, mtu 1500, reachable Hop 2: endpoint , latency 0.14ms, mtu 1428, reachable Hop 3: endpoint , latency 1.13ms, mtu 1428, reachable The director tracepath command is a new command that is part of VPLEX GeoSynchrony v5.0 that allows an administrator to test the round trip latency and hop count from a source VPLEX director to another remote director or VPLEX Witness. This command can be used to aid in troubleshooting network latency and routing issues. Note that the first hop is from the local director (.36) to the management port for side B (.33). The next hop shows the same IP address, but in this case we also note the MTU has changed. This shows us the clustered management port on the second site is management port (.33) and it crosses a LAN. The final hop is from the remote management port to the remote director (.67). Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

29 Identify Inter-Cluster Link Failures: cluster summary
VPlexcli:/> cluster summary Clusters: Name Cluster ID Connected Expelled Operational Status Health State cluster— true false ok ok cluster— true false transitioning degraded Islands: Island ID Clusters 1 cluster—1 2 cluster—2 VPlexcli:/> cluster summary Clusters: Name Cluster ID Connected Expelled Operational Status Health State cluster—1 1 true false ok ok cluster—2 2 true false ok ok Islands: Island ID Clusters 1 cluster—1, cluster—2 An inter-cluster link failure can be identified by running the cluster summary command. If two islands are listed, it’s a sure sign that the clusters can’t communicate with each other and that there is an inter-cluster link failure. In a normal VPLEX Metro or Geo there should only be one island and both clusters should be listed in the island. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

30 Identify Inter-Cluster Link Failures: WAN COM Links
VPlexcli:/> ll /**/ports /engines/engine-1-1/directors/director-1-1-A/hardware/ports: Name Address Role Port Status A0-FC00 0x front-end up A0-FC01 0x front-end up A0-FC02 0x front-end no-link A0-FC03 0x front-end no-link A1-FC00 0x back-end up A1-FC01 0x back-end up A1-FC02 0x back-end no-link A1-FC03 0x back-end no-link A2-XG wan-com up A2-XG wan-com no-link A3-FC00 0x local-com up A3-FC01 0x local-com up A3-FC02 0x up A3-FC03 0x up Another possible indication of an inter-cluster link failure could be that the WAN-COM links are showing “no-link” or “down”. The WAN-COM link port status should be displayed as “up”. The port status of VPLEX ports can be checked by using the ll command within the ports directory. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

31 Identify Inter-Cluster Link Failures: WAN COM Links
VPlexcli:/clusters/cluster-1/consistency-groups/async_consist> ll Attributes: Name Value active-clusters [cluster-1] cache-mode asnychronous detach-rule winner cluster-1 after 5s operational-status [(cluster-1,( summary:: suspended, details:: [cluster-departure] )), (cluster-2,{ summary:: suspended, details:: [cluster-departure] ))] passive-clusters [cluster-2] storage-at-clusters [cluster-1, cluster-2] virtual-volumes [dd_vol] visibility [cluster-1, cluster-2] Another possible indication of an inter-cluster link failure is when the consistency groups display “cluster-departure”. When a consistency group is functioning properly the summary field should display ok and there shouldn’t be any details. The top box displays the state as suspended because the detach rules have not fired. Once the detach rules fire, cluster-1 will be allowed to continue I/O and cluster-2 will suspend. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

32 Identify VPN Issues: vpn status
VPlexcli:/> vpn status Verifying the VPN status between the management servers... IPSEC is UP Remote Management Server at IP Address is reachable Remote Internal Gateway addresses are reachable Verifying the VPN status between the management server and the cluster witness server... Cluster Witness Server at IP Address is not reachable VPlexcli:/> vpn status Verifying the VPN status between the management servers... IPSEC is UP Remote Management Server at IP Address is reachable Remote Internal Gateway addresses are reachable Verifying the VPN status between the management server and the cluster witness server... Cluster Witness Server at IP Address is reachable The vpn status command can be used to identify any VPN issues between the two management servers and the VPLEX Witness. Here the vpn status command shows that there is a connection issue with the VPLEX Witness. This could mean that there is a problem with the VPN connection or it could me that the VPLEX Witness is down. More analysis needs to be done to identify the reason that it is not reachable. In a healthy VPLEX environment IPsec should be up for both the remote cluster and the VPLEX Witness and their IP addresses should be reachable. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

33 Check the VPN Uptime Management-server-2:~> sudo /usr/sbin/ipsec statusall 000 interface lo/lo ::1: interface lo/lo : interface eth0/eth : interface eth3/eth : interface eth2/eth : interface eth1/eth : %myid = (none) 000 debug none 000 Performance: uptime: 2 days, since Aug 02 16:30: worker threads: 9 idle of 16, job queue load: 0, scheduled events: 6 loaded plugins: aes des sha1 sha2 md5 fips-prf random x509 pubkey xcbc hmac gmp kernel-netlink stroke updown Listening IP addresses: Connections: net-net: [C=US, ST=Massachusetts, O=EMC, OU=EMC, CN=VPlex VPN: LEVEL3MEDIUM02, ST=Massachusetts, O=EMC, OU=EMC, CN=VPlex VPN: LEVEL3MEDIUM01, net-net: CAs: "C=US, ST=Massachusetts, L=Hopkinton, O=EMC, OU=EMC, CN=LEVEL3MEDIUM01, net-net: public key authentication net-net: / /27 === / /27 Security Associations: net-net[22]: ESTABLISHED 26 minutes ago, [C=US, ST=Massachusetts, O=EMC, OU=EMC, CN=VPlex VPN: LEVEL3MEDIUM02, ST=Massachusetts, O=EMC, OU=EMC, CN=VPlex VPN: LEVEL3MEDIUM01, net-net[22]: IKE SPIs: 06db3d4cd388d131_i* 74f967a554f24965_r, public key reauthentication in 2 hours net-net[22]: IKE proposal: 3DES/AUTH_HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_2048_BIT net-net{22}: INSTALLED, TUNNEL, ESP SPIs: cd94995f_i cd1b75be_o net-net{22}: AES_CBC-256/AUTH_HMAC_SHA2_256_128, rekeying in 17 minutes, last use: 0s_i 0s_o Check on the status and uptime of the VPN from the service account. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

34 Cache Vaulting Process Flow
Vault Inactive Cluster wide power failure Manual vault Power restored Ride-out Power Quiesce IO Write vault Stop director firmware Vault written Power ride-out expired Dirty cache pages frozen When a cluster detects power loss in both of its power zones, VPLEX triggers the cluster to enter the 30 second ride-out phase. This delays the (irreversible) decision to vault, allowing for a timely return of AC input to avoid vaulting altogether. It also guarantees that every engine will be powered on long enough (5 minutes) for all of its dirty data to be dumped to vault storage. During the ride-out phase, all mirror rebuilds and migrations pause, and new configuration changes on the local cluster are disallowed, to prepare for a possible vault. If the power is restored prior to the 30 second ride-out, all mirror rebuilds and migrations resume, and configuration changes are once again allowed. If the power is not restored within 30 seconds, the cluster begins vaulting. Power ride-out is not necessary when a manual vault has been requested. However, similar to power ride-out phase, manual vaulting stops any mirror rebuilds and migrations and disallows any configuration changes on the local cluster. Once all I/O is discontinued, the inter-cluster links are disabled to isolate the vaulting cluster from the remote cluster. These steps are required to freeze the director’s dirty cache in preparation for vaulting. Once the dirty cache has been frozen, each director in the vaulting cluster isolates itself from the other directors and starts writing. When finished writing to its vault, the director stops its firmware. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

35 Vault Status VPlexcli:/> vault status -c cluster-2 --verbose ================================================================================ Cluster level vault status summary Cluster:/clusters/cluster-2 Cluster is vaulting Total number of bytes remaining to vault in the cluster: GB Estimated time remaining for cluster's vault completion: 10 seconds Director level vault status summary /engines/engine-2-1/directors/director-2-1-B: state: Vaulting - Writing vault data to vault disk Total number of bytes to vault: MB Total number of bytes vaulted: MB Total number of bytes remaining to vault: MB Percent vaulted: 1% Average vault rate: MB/second Estimated time remaining to vault complete: 10 seconds /engines/engine-2-1/directors/director-2-1-A: Total number of bytes to vault: MB Total number of bytes remaining to vault: MB Average vault rate: MB/second Estimated time remaining to vault complete: 10 second Vaulting dumps all dirty data to persistent local storage. Vaulted data is recovered (“unvaulted”) when power is restored. When run after a vault has begun and the vault state is 'Vault Writing‘ or 'Vault Written', the following information is displayed: Total number of bytes to be vaulted in the cluster Estimated time to completion for the vault. When run after the directors have booted and unvaulting has begun and the states are 'Unvaulting' or 'Unvault Complete', the following information is displayed: Total number of bytes to be unvaulted in the cluster Estimated time to completion for the unvault. Percent of bytes remaining to be unvaulted Number of bytes remaining to be unvaulted. If the --verbose argument is used, the following additional information is displayed: Average vault or unvault rate Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

36 Forced Vault Recovery VPlexcli:/> vault overrideUnvaultQuorum --evaluate-override-before-execution -c cluster-1 Cluster's unvault recovery quorum status: Only 3 out of 4 configured directors on this cluster are running, and none has reported a valid vault. All configured directors must be present to verify if any director has successfully vaulted dirty data the last time the cluster was servicing I/O. Missing directors in the cluster: director-1-1-A VPlexcli:/> vault overrideUnvaultQuorum -c cluster-1 Warning: Execution of this command can result in possible data loss based on the current vault status of the cluster. Only 3 out of 4 configured directors on this cluster are running, and none has reported a valid vault. 3 out of 4 directors that were servicing I/O the last time the cluster had vaulted are present, which is sufficient to proceed with vault recovery. Do you wish to override unvault quorum? (Yes/No) Yes Execution of the override unvault quorum has been issued! A director’s vaulted data (it’s “vault”) is valid if the director can determine that the data is not corrupted or stale. A cluster’s unvault quorum is a state in which some directors have determined that their vault is valid while other directors are either not operational or have not determined that their vault is valid. In this state, those directors that have successfully determined that their vault is valid wait for the remaining directors before proceeding with recovery. Use this command to tell the cluster not to wait for all the required director(s) before proceeding with vault recovery. Unvault recovery quorum is the set of directors that had vaulted their dirty cache data during the last successful cluster-wide vault. These directors must boot and rejoin the cluster in order to recover their vaulted data which is needed to preserve cache coherency and avoid data loss. The cluster will wait indefinitely for these directors to join the cluster. Use this command with the --evaluate-override-before-execution argument to evaluate the cluster's vault status and make a decision whether to accept a possible data loss and continue to bring the cluster up. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

37 Inter-cluster Network A Inter-cluster Network B
VPLEX Witness Failure Domain #3 Failure Domain #1 Failure Domain #2 IP Management Network Inter-cluster Network A Inter-cluster Network B Cluster Witness should be in a “third failure domain”. Cluster Witness Server is installed as a VM. Cluster Witness Server provides guidance to both clusters depending its observations. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

38 VPLEX Witness Troubleshooting: VPLEX Witness Connectivity
The output for configuration automation has been captured in /var/log/VP!ex/cli/capture/VPlexconfiguration—session.txt configuration Evaluation of <<configuration cw-vpn-configure cw-vpn-configure: -i >> failed. cause: Command execution failed. cause: Unable to get the certificate subject info file from — Scp Connection Failure 3–Way VPN will Fail because of connectivity ping PING ( ) 56(84) bytes of data. 64 bytes from : icmp_seq=l tt1=64 time=4.92 ms 64 bytes from : icmp_seq=2 tt1=64 time=0.403 ms 64 bytes from : icmp_seq=3 tt1=64 time=0.415 ms When configuring the VPLEX Witness 3-way VPN, an error displaying “unable to get certificate subject info” could be the result of the cluster not having connectivity to the VPLEX Witness. To verify that there is connectivity, ping the VPLEX Witness from the Management Server. The round trip latency between both the VPLEX Witness and the Management Server should be less than one second. If the VPLEX Witness is not pingable, ensure that the VPLEX Witness VM is powered on and also ensure that IP connectivity is has been established. If there is a routing issue contact the IP network administrator to fix the “default route” for the VPLEX Witness Server VM and/or Management Server. Also ensure that the Management Server can ping the ESX host that the VM is residing on. Management Server should be able to ping the VPLEX Witness Round trip latency should be < 1 Second Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

39 Module 7: VPLEX Troubleshooting
Lesson 3: Summary During this lesson the following topics were covered: How to list the connectivity components in VPLEX How to define the logs and commands used to troubleshoot VPLEX connectivity How to perform basic troubleshooting operations to determine and correct root cause analysis This lesson covered how to troubleshoot VPLEX connectivity issues. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

40 Module 7: VPLEX Troubleshooting
Lesson 4: Recovering Distributed Devices List the methods used to recover from WAN failure Define the choices of resume and rollback Show how to recover from a WAN failure This lesson covers recovering Distributed Device issues. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

41 Methods to Resume after WAN Failure
consistency-group resolve-conflicting-detach consistency-group resume-after-data-loss-failure consistency-group resume-after-rollback consistency-group resume-at-loser There are 4 methods of resolving a WAN failure. We will review the primary cases for recovery after a WAN failure or other disruption to IO to a distributed device. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

42 Link Failure Effects on Consistency Groups
Detach Rule C1 is Active (I/O) C2 is Passive (No I/O) C1 is Passive (No I/O) C2 is Active (I/O) Cluster-1 Wins I/O is allowed on C1 Requires resume-after-rollback at C1 I/O suspends at C2 Requires resume-after-rollback at C2 No data loss / No data rollback Requires a Data Rollback Cluster-2 Wins I/O suspends at C1 I/O is allowed at C2 No data loss / no data rollback Active Cluster Wins I/O is allowed at C1 No Automatic Winner This tables displays the outcomes of a inter-cluster link failure for the various detach rules and I/O profiles. It is important to note that if the losing cluster is active during a failure, there will be a data rollback situation. Writes that were in the open, closed, and exchanging delta phases from the losing cluster will be lost. The best outcome is to have the winning cluster be only active cluster prior to the inter-cluster link failure. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

43 Consistency Group States
Inter-Cluster Link is Down During a Failure Group has detached (I/O is going to the winning cluster). Group does not have a detach rule and I/O is suspended at both clusters. Group was active at the losing cluster and is suspended requiring the consistency-group resume-after-rollback command at the winning cluster. Inter-Cluster Link is Up After a Failure Group that was detached during failure continues to allow I/O at the winning cluster. Group that was previously suspended is now running and healthy at both clusters. Group has a conflicting detach because the administrator allowed I/O to go to both clusters while the inter-cluster link was down. Shown here are the possible consistency group states that the consistency groups can be in during an inter-cluster link outage and also after the inter-cluster link outage heals. The next few slides will describe various failure scenarios and the steps that the administrator can perform to recover from the failure. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

44 Consistency-group resolve-conflicting-detach
Winner SAN Loser Access to storage during fail No access to storage during fail Distributed Device Application Cluster Restore Direction In this case, the IO at the winning cluster continues while the loser ceases IO. When the resume command is issued, the dirty cache in the loser side is discarded and the data is overwritten with the data from the winning cluster. CAUTION - This command results in data loss at the losing cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

45 Resolve-conflicting-detach Scenario
VPlexcli:/clusters/cluster-1/consistency-groups/cg1> ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [(cluster-1,{summary::ok, details::[requires-resolve-conflicting-detach]}), (cluster-2,{summary::ok, details::[requires-resolve-conflicting-detach]})] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] Visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint During an inter-cluster link failure, an administrator may permit I/O to continue at both clusters. When I/O continues at both clusters: The data images at the clusters diverge. Legs of distributed volumes are logically separate. CAUTION - This command results in data loss at the losing cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

46 Resolve-conflicting-detach Resolution
VPlexcli:/> VPlexcli:/clusters/cluster-1/consistency-groups/cg1> resolve-conflicting-detach -c cluster-1 This will cause I/O to suspend at clusters in conflict with cluster cluster-1, allowing you to stop applications at those clusters. Continue?(Yes/No) Yes VPlexcli:/clusters/cluster-1/consistency-groups/cg1> ls attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [(cluster-1,{summary::ok, details::[]}), (cluster-2,{summary::suspended, details::[requires-resume-at- loser]})] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint When the inter-cluster link is restored, the clusters learn that I/O has proceeded independently. I/O continues at both clusters until the administrator picks a “winning” cluster whose data image will be used as the source to resynchronize the data images. Use this command to pick the winning cluster. For the distributed volumes in the consistency group: I/O at the “losing” cluster is suspended (there is an impending data change) The administrator stops applications running at the losing cluster. Any dirty cache data at the losing cluster is discarded The legs of distributed volumes rebuild, using the legs at the winning cluster as the rebuild source. When the applications at the losing cluster are shut down, use the “consistency-group resume-after-data-loss-failure”' command to allow the system to service I/O at that cluster again. CAUTION - This command results in data loss at the losing cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

47 Consistency-group resume-after-data-loss-failure
Winner SAN Loser Access to storage during fail May have access to storage during fail Distributed Device Application Cluster Restore Direction Resumes I/O on an asynchronous consistency group when there are data loss failures. When the resume command is issued, the system selects a winning cluster which will be the target data image. The loser side will have its data overwritten with the data from the winning cluster. Because of dirty cache on the losing side which may be discarded, the total data loss may be larger than expected. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

48 Resume-after-data-loss-failure Resume=True Scenario
VPlexcli:/clusters/cluster-1/consistency-groups/CG1> ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode synchronous detach-rule winner cluster-2 after 5s operational-status [suspended, requires-resume-after-data-loss-failure] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes DR1_RAM_c2win_lr1_36_vol,DR1_RAM_c2win_lr1_16_vol, DR1_RAM_c2win_lr0_6_vol,DR1_RAM_c2win_lrC_46_vol visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint In the event of multiple near-simultaneous director failures, or a director failure followed very quickly by an inter-cluster link failure, an asynchronous consistency group may experience data loss. I/O automatically suspends on the volumes in the consistency group at all participating clusters. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

49 Resume-after-data-loss-failure Resume=True Resolution
VPlexcli:/clusters/cluster-1/consistency-groups/CG1> resume-after-data-loss-failure -f -c cluster-1 VPlexcli:/clusters/cluster-1/consistency-groups/CG1> ls Attributes: Name Value active-clusters [cluster-2] cache-mode synchronous detach-rule winner cluster-2 after 5s operational-status [ok] passive-clusters [cluster-1] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes DR1_RAM_c2win_lr1_36_vol,DR1_RAM_c2win_lr1_16_vol, DR1_RAM_c2win_lr0_6_vol,DR1_RAM_c2win_lrC_46_vol visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint Use this command to resume I/O. Specifically, this command: Selects a winning cluster whose current data image will be used as the base from which to continue I/O. On the losing cluster, synchronizes the data image with the data image on the winning cluster. Resumes I/O at both clusters. This command may make the data loss larger, because dirty data at the losing cluster may be discarded. All the clusters participating in the consistency group must be present in order to use this command. If there has been no data-loss failure in the group, this command prints an error message and does nothing. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

50 Resume-after-data-loss-failure Resume=False Scenario
VPlexcli:/clusters/cluster-1/consistency-groups/CG1>ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode synchronous detach-rule winner cluster-2 after 5s operational-status [suspended, requires-resume-after-data-loss-failure] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes DR1_RAM_c2win_lr1_36_vol,DR1_RAM_c2win_lr1_16_vol, DR1_RAM_c2win_lr0_6_vol,DR1_RAM_c2win_lrC_46_vol visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint In this example, the auto-resume-at-loser property of the consistency group is set to false; that is I/O remains suspended on the losing cluster when connectivity is restored. I/O must be manually resumed. The ls command displays the operational status of a consistency group where cluster-1 is the winner (detach-rule is winner) after multiple failures at the same time have caused a data loss. The cluster-1 side of the consistency group is suspended. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

51 Resume-after-data-loss-failure Resume=False Resolution
VPlexcli:/clusters/cluster-1/consistency-groups/CG1> resume-after-data-loss-failure –f –c cluster-1 VPlexcli:/clusters/cluster-1/consistency-groups/CG1> ls Attributes: Name Value active-clusters [cluster-2] cache-mode synchronous detach-rule winnercluster-2after5s operational-status [ok] passive-clusters [cluster-1] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumesD R1_RAM_c2win_lr1_36_vol,DR1_RAM_c2win_lr1_16_vol, DR1_RAM_c2win_lr0_6_vol,DR1_RAM_c2win_lrC_46_vol visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint In this example, the auto-resume-at-loser property of the consistency group is set to false; that is I/O remains suspended on the losing cluster when connectivity is restored. I/O must be manually resumed. The resume-after-data-loss-failure command selects cluster-2 as the source image from which to re-synchronize data. After a short wait, the ls command displays that the cluster-1 side of the consistency group remains suspended: Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

52 Consistency-group resume-after-rollback
Winner SAN Loser Access to storage during fail Rollback to consistent data for restore IO stopped during fail Distributed Device Application Cluster Resumes I/O to the volumes on the winning cluster in a consistency group after: The losing cluster(s) have been detached, and Data has been rolled back to the last point at which all clusters had a consistent view. When the resume command is issued, the winning cluster rolls back its data image to the last point at which the clusters had the same data images, and then allows I/O to resume at that cluster. At the losing cluster, I/O remains suspended. I/O remains suspended at the losing cluster, unless the 'auto-resume' flag is set to 'true'. WARNING - In a Geo configuration, on a cluster that successfully vaulted and unvaulted, the user should contact EMC Engineering for assistance before rolling back the data prior to re-establishing communication with then on-vaulting cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

53 Resume-after-rollback Scenario
VPlexcli:/clusters/cluster-1/consistency-groups/cg1> choose-winner –c cluster-1 WARNING: This can cause data divergence and lead to data loss. Ensure the other cluster is not serving I/O for this consistency group before continuing. Continue?(Yes/No) Yes VPlexcli:/clusters/cluster-1/consistency-groups/cg1> ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [cluster-departure, rebuilding-across-clusters, restore-link-or-choose-winner]}), (cluster-2,{summary:: suspended, details:: [cluster-departure, rebuilding-across-clusters, restore-link-or-choose-winner] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint This command is part of a two-step recovery procedure to allow I/O to continue in spite of an inter-cluster link failure. Use the “consistency-group choose-winner” command to select the winning cluster. Use this command to tell the winning cluster to roll back its data image to the last point here the clusters were known to agree, and then proceed with I/O. The first step in the recovery procedure can be automated by setting a detach-rule-set The second step is required only if the losing cluster has been “active”, that is, writing to volumes in the consistency group since the last time the data images were identical at the clusters. If the losing cluster is active, the distributed cache at the losing cluster contains dirty data, and without that data, the winning cluster's data image is inconsistent. WARNING - In a Geo configuration, on a cluster that successfully vaulted and unvaulted, the user should contact EMC Engineering for assistance before rolling back the data prior to re-establishing communication with then on-vaulting cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

54 Resume-after-rollback Resolution
VPlexcli:/clusters/cluster-1/consistency-groups/cg1> resume-after-rollback This will change the view of data at cluster cluster-1, so you should ensure applications are stopped at that cluster. Continue? (Yes/No) Yes VPlexcli:/clusters/cluster-1/consistency-groups/cg1> ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [(cluster-1,{summary::ok, details::[]}), (cluster-2,{summary::suspended, details::[cluster- departure]})] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint Resuming I/O at the winner requires rolling back the winner's data image to the last point where the clusters agreed. Applications may experience difficulties if the data changes, so the roll-back and resumption of I/O is not automatic. The delay gives the administrator the chance to halt applications. The administrator then uses this command start the rollback roll-back in preparation for resuming I/O. The winning cluster rolls back its data image to the last point at which the clusters had the same data images, and then allows I/O to resume at that cluster. At the losing cluster, I/O remains suspended. When the inter-cluster link is restored, I/O remains suspended at the losing cluster, unless the 'auto-resume' flag is set to 'true'. WARNING - In a Geo configuration, on a cluster that successfully vaulted and unvaulted, the user should contact EMC Engineering for assistance before rolling back the data prior to re-establishing communication with then on-vaulting cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

55 Consistency-group resume-at-loser
Winner SAN Loser Access to storage during fail IO stopped during fail Waits for user to resume Distributed Device Application Cluster Restore Direction If I/O is suspended due to a data change, this option resumes I/O at the specified cluster and consistency group. When the resume command is issued, the winning cluster continues IO. The loser stops all IO. When the link is restored, IO remains stopped at the loser site until the user enables IO. IO is then restored from the winning cluster. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

56 Resume-at-loser Scenario
VPlexcli:/clusters/cluster-1/consistency-groups/cg1> ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [(cluster-1,{summary::ok, details::[]}), (cluster-2,{summary::suspended, details::[requires-resume-at-loser]})] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint During an inter-cluster link failure, an administrator may permit I/O to resume at one of the two clusters: the “winning” cluster. I/O remains suspended on the “losing” cluster. When the inter-cluster link heals, the winning and losing clusters re-connect, and the losing cluster discovers that the winning cluster has resumed I/O without it. Unless explicitly configured otherwise (using the auto-resume-at-loser property), I/O remains suspended on the losing cluster. This prevents applications at the losing cluster from experiencing a spontaneous data change. The delay allows the administrator to shut down applications. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

57 Resume-at-loser Resolution
VPlexcli:/clusters/cluster-1/consistency-groups/cg1>resume-at-loser –c cluster-2 This may change the view of data presented to applications at cluster cluster-2.You should first stop applications at that cluster. Continue?(Yes/No)Yes VPlexcli:/clusters/cluster-1/consistency-groups/cg1>ls Attributes: Name Value active-clusters [cluster-1,cluster-2] cache-mode asynchronous detach-rule no-automatic-winner operational-status [(cluster-1,{summary::ok,details::[]}), (cluster-2,{summary::ok,details::[]})] passive-clusters [] recoverpoint-enabled false storage-at-clusters [cluster-1,cluster-2] virtual-volumes [dd1_vol,dd2_vol] visibility [cluster-1,cluster-2] Contexts: advanced recoverpoint After stopping the applications, the administrator can use this command to: Resynchronize the data image on the losing cluster with the data image on the winning cluster, Resume servicing I/O operations. The administrator may then safely restart the applications at the losing cluster. Without the '--force' option, this command asks for confirmation to proceed, since its accidental use while applications are still running at the losing cluster could cause applications to misbehave. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

58 Module 7: VPLEX Troubleshooting
Lesson 4: Summary During this lesson the following topics were covered: How to list the methods used to recover from WAN failure How to define the choices of resume and rollback How to recover from a WAN failure This lesson covered how to recover Distributed Device issues. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

59 VPLEX Troubleshooting
This lab covers how to troubleshoot a VPLEX environment. Perform a system data collection Determine root cause analysis This lab covers how to troubleshoot a VPLEX environment. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

60 Module 7: Check Your Knowledge - Questions
Which file has readable logs for the field support technician can be found? diagnostics diagnostics-extended configdump.xls Perpetual-monitors If the cluster fails and cannot access the meta volume, what is the first step to recover the system? Activate a backup copy Copy the running metadata to disk Promote a backup copy Restore the metadata from the backup What method is used to replace the array that holds the Metadata volume? Mirror the meta device Migrate the device Configure a Mobility job for the volume Promote a backup device Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

61 Module 7: Check Your Knowledge - Answers
Which file has readable logs for the field support technician can be found? diagnostics diagnostics-extended configdump.xls Perpetual-monitors If the cluster fails and cannot access the meta volume, what is the first step to recover the system? Activate a backup copy Copy the running metadata to disk Promote a backup copy Restore the metadata from the backup What method is used to replace the array that holds the Metadata volume? Mirror the meta device Migrate the device Configure a Mobility job for the volume Promote a backup device Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

62 Module 7: Check Your Knowledge - Questions
What does the director ping command test? Latency between the sites Latency between the directors WAN latency VPN performance What command is used to restore an asynchronous consistency groups that fails and both sites were active and have dirty cache? resolve-conflicting-detach resume-after-data-loss-failure resume-after-rollback resume-at-loser Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

63 Module 7: Check Your Knowledge - Answers
What does the director ping command test? Latency between the sites Latency between the directors WAN latency VPN performance What command is used to restore an asynchronous consistency groups that fails and both sites were active and have dirty cache? resolve-conflicting-detach resume-after-data-loss-failure resume-after-rollback resume-at-loser Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

64 Module 7: Summary Show how to perform data collection processes
Show how to perform basic troubleshooting on various VPLEX elements Show how to collect and analyze data to resolve connectivity errors This module covered how to troubleshoot a VPLEX environment. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting

65 Installation of a VPLEX Metro configuration
Common VPLEX terms, configuration options, hardware and software architecture Installation of a VPLEX Metro configuration Performing a Non-disruptive upgrade How data flows within a VPLEX system at a high level Managing VPLEX using the GUI and CLI Provisioning virtual volumes to hosts Encapsulating an existing SAN storage volumes into VPLEX Performing VPLEX mobility Designing VPLEX into a new or existing data center Performing routine VPLEX monitoring tasks This course covered the installation and configuration of VPLEX. Usage of the product, monitoring and troubleshooting were also covered. Module 7 - VPLEX Troubleshooting Module 7 - VPLEX Troubleshooting


Download ppt "Module 7: VPLEX Troubleshooting"

Similar presentations


Ads by Google