Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLARiiON CX Series Boot Process March 2005

Similar presentations


Presentation on theme: "CLARiiON CX Series Boot Process March 2005"— Presentation transcript:

1 CLARiiON CX Series Boot Process March 2005
VMware vCenter Site Recovery Manager 5.x with EMC VNX Arrays & MirrorView By Dave O’Sullivan

2 CLARiiON CX Series Boot Process March 2005
Intended Audience: CLARiiON CX Series Boot Process March 2005 VNX Block CLARIION Block This training is designed to give an overview of SRM and explain how the relevant software plugins and hardware all interact with each other. This will cover : Pre-requisites Design Test Failover / Failover / Recovery [DEMO] Required Logs Troubleshooting

3 CLARiiON CX Series Boot Process March 2005
Assumptions: CLARiiON CX Series Boot Process March 2005 You are familiar [not expert] with: VM’s ! vCenter MirrorView A and S VNX / CLARIION Arrays SRA (s) ??? Before the customer does any work with SRM: MirrorView is working (check zoning) All the appropriate software is installed. [including enablers]

4 CLARiiON CX Series Boot Process March 2005
What is SRM? CLARiiON CX Series Boot Process March 2005 ensures the simplest and most reliable disaster protection for all virtualized applications. Site Recovery plans can be tested non-disruptively as frequently as required to ensure that they meet business objectives. At the time of a site failover or migration, Site Recovery Manager automates both failover and failback processes, ensuring fast and highly predictable recovery point objectives (RPOs) and recovery time objectives (RTOs).

5 CLARiiON CX Series Boot Process March 2005
Pre-requisites CLARiiON CX Series Boot Process March 2005 SRM is heavily reliant on DNS, so it would be assumed DNS is fully setup and all hosts can be resolved in both directions. IP Connectivity between all SP’s / VC / ESX on both sites. SRM is also reliant on Databases, in this setup there are 4 in total. 1 DB for VC 1 DB for SRM This applies for both sites. This doc covers the DB setup in full details : Virtual How to Install and Configure SQL Express 2005 For use with Site Recovery Manager V4 Rob Nourse, Sr. Consultant VMware Consulting Services

6 CLARiiON CX Series Boot Process March 2005
Design: CLARiiON CX Series Boot Process March 2005 ….

7 Design considerations : IP / DNS
CLARiiON CX Series Boot Process March 2005

8 Design considerations : Software / Plugins
CLARiiON CX Series Boot Process March 2005

9 Design considerations: MirrorView
CLARiiON CX Series Boot Process March 2005 In order for SRM failover to work, the “protected” VM’s must be located within a LUN that is replicated form production site to DR site. This is handled my MirrorView A/S (we are using A in this setup) Below is the LUN info for my setup:

10 Design considerations: MirrorView Zoning
CLARiiON CX Series Boot Process March 2005 For MirrorView to work, we need to ensure that the appropriate ports are zoned together. So in the setup, the FC ports used for MirrorView are zoned to the opposite Array.

11 Design considerations: MirrorView
CLARiiON CX Series Boot Process March 2005

12 Design considerations: MirrorView
CLARiiON CX Series Boot Process March 2005 LUN is created first on Prod (Athena) side Then used MirrorView options “create secondary mirror” and follow thru wizard. I used the Mirror Wizard to complete this task

13 Design considerations: MirrorView
CLARiiON CX Series Boot Process March 2005 When its working, it should look like this:

14 Design considerations: Reserved Lun Pool
CLARiiON CX Series Boot Process March 2005 You must add LUNs with adequate capacity to the Reserved LUN Pool before proceed. This will be used when the SRA calls a snapshot for the SRM failover test (only!)

15 Design considerations: DEMO
CLARiiON CX Series Boot Process March 2005

16 Design considerations:
CLARiiON CX Series Boot Process March 2005 That’s pretty much it on the VNXZ side Once MirrorView is up & running, you should be good to go with the SRM windows / VMWare side of the setup. Next What is the SRA?

17 SRA [Storage Replication Adapters]
CLARiiON CX Series Boot Process March 2005 What is the SRA? The SRA is a windows .exe installed on the same windows box as SRM as part of the SRM setup The vCenter “talks” to the SRA -> the SRA sends navi commands to the Array. This is why naviseccli is required to be installed don the same box as SRM (check path!) Each Array vendor has their own SRA adapters. The SRA’s are EMC code, so we support them! SRAs for SRM 5.x For the full list of storage replication adapters supported by SRM 5.x, see

18 SRA [Storage Replication Adapters]
CLARiiON CX Series Boot Process March 2005 These are the most current supported EMC SRA’s

19 SRA [Storage Replication Adapters]
CLARiiON CX Series Boot Process March 2005 So on both sites, the following should be installed: Note that there is 2 SRA’s VNX SRA for vCenter MirrorView enabler for VNX SRA As we are only doing block replication, we only need the MirrorView enabler. NFS replication is also possible using the EMC_VNX_Replicator_Enabler_for_VNX_SRA_v5.0.xx

20 SRA [Storage Replication Adapters] DEMO
CLARiiON CX Series Boot Process March 2005

21 Test Failover – sequence of events
CLARiiON CX Series Boot Process March 2005 We will look happens on both sites concerning: SRM VNX ESX

22 Test Failover – sequence of events [PROD]
CLARiiON CX Series Boot Process March 2005

23 Test Failover – sequence of events [PROD]
CLARiiON CX Series Boot Process March 2005 - Ensure to check the output form: Recovery Plan History Report VMware Site Recovery Manager 5.0

24 Test Failover – sequence of events [PROD]
CLARiiON CX Series Boot Process March 2005 - The is a cosmetic issue which *should* be fixed in later versions of SRM Warning: Failed to update embedded paths in virtual machine file '/vmfs/volumes/507432e1-3a92a2a4-027e-b8ac6f866cc6/2008-1/2008-1_1.vmdk'. A general system error occurred: No such device Failed to update embedded paths in virtual machine file '/vmfs/volumes/507432e1-3a92a2a4-027e-b8ac6f866cc6/2008-2/2008-2_1.vmdk'. A general system error occurred: No such device

25 Test Failover – sequence of events [PROD ESX]
CLARiiON CX Series Boot Process March 2005 - Some fairly serious errors in the vmkernel on prod. Esx, these can be ignored. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba2:C0:T1:L3" Peer SP is hung. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba3:C0:T0:L3" Peer SP is hung. )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa da02e0012ce6a8f930de211 (path vmhba3:C0:T1:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa da02e0012ce6a8f930de211 (path vmhba2:C0:T0:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )NMP: nmp_DeviceUpdatePathStates:547: Activated path "vmhba2:C0:T1:L9" for NMP device "naa da02e0012ce6a8f930de211". - Watch out for messages like this, customers could open cases based on these errors alone…

26 Test Failover – sequence of events [DR]
CLARiiON CX Series Boot Process March 2005

27 Test Failover – sequence of events[DR ESX]
CLARiiON CX Series Boot Process March 2005 LVM: 8445:00:00 Device naa e005e32bf721a12e211:1 detected to be a snapshot: 8452:00:00 queried disk ID: <type 2, len 22, lun 11, devType 0, scsi 0, h(id) > 8459:00:00 on-disk disk ID: <type 2, len 22, lun 1, devType 0, scsi 0, h(id) > 8825:00:00 Device naa e005e32bf721a12e211:1 unsnapped 5510:00:00 Snapshot LV <snap-37ce81f0-503f25d9-c56a845d-4ee3-0026b > successfully resignatured 13188 : One or more LVM devices have been discovered.

28 Test Failover – sequence of events[DR VNX]
CLARiiON CX Series Boot Process March 2005 A /09/12 14:20:23 SnapCopy a Snapshot Logical Unit device CopyDisk0000 has been created. B /09/12 14:20:23 SnapCopy a Snapshot Logical Unit device CopyDisk0000 has been created. A /09/12 14:20: 'Create a SnapShot LU' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) A /09/12 14:20: '' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - async-25_SRM-TEST-FAILOVER_session) A /09/12 14:20:27 SnapCopy SnapView persistent session async-25_SRM-TEST-FAILOVER_session has been started on LUN 25. B /09/12 14:20:27 SnapCopy SnapView persistent session async-25_SRM-TEST-FAILOVER_session has been started on LUN 25. A /09/12 14:20: 'Activate' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11 (async-25_SRM-TEST-FAILOVER_session)) A /09/12 14:20: 'ExecuteClientRequest' called by ' Navi User admin' ( ) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 9 -snapshotname async-25_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' A /09/12 14:20:39 RemoteMirror MirrorView quiesce LU request. A /09/12 14:20:39 RemoteMirror RM_ADMIN_INFO_WILL_REBIND the object. A /09/12 14:20:39 RemoteMirror MirrorView rebind request for LUN b9502e00:a4cc7fc81507e211. A /09/12 14:20:39 SnapCopy SnapView has been bound to device B /09/12 14:20:39 RemoteMirror Quiesce request from peer SP. B /09/12 14:20:39 RemoteMirror Rebind request from peer SP for LUN b9502e00:a4cc7fc81507e211. B /09/12 14:20:39 SnapCopy SnapView has been bound to device Disk0001. A /09/12 14:20: 'Create a SnapShot LU' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) A /09/12 14:20:40 SnapCopy a Snapshot Logical Unit device CopyDisk0001 has been created. B /09/12 14:20:40 SnapCopy a Snapshot Logical Unit device CopyDisk0001 has been created. A /09/12 14:20: '' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - sync-24_SRM-TEST-FAILOVER_session) A /09/12 14:20:43 Bus1 Enc0 DskE a A logical unit has been enabled [ALU 2] ffff e A /09/12 14:20:43 SnapCopy SnapView persistent session sync-24_SRM-TEST-FAILOVER_session has been started on LUN 24.

29 Test Failover – sequence of events[DR VNX]
CLARiiON CX Series Boot Process March 2005 B /09/12 14:20:43 Bus1 Enc0 DskE Unit Shutdown for Trespass [ALU 2] ffff e B /09/12 14:20:43 SnapCopy SnapView persistent session sync-24_SRM-TEST-FAILOVER_session has been started on LUN 24. A /09/12 14:20: 'Activate' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11 (sync-24_SRM-TEST-FAILOVER_session)) A /09/12 14:20: 'ExecuteClientRequest' called by ' Navi User admin' ( ) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 10 -snapshotname sync-24_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' A /09/12 14:20:55 RemoteMirror MirrorView quiesce LU request. A /09/12 14:20:55 RemoteMirror MirrorView rebind request for LUN b9502e00:fe606cb30d03e211. A /09/12 14:20:55 SnapCopy SnapView has been bound to device f. A /09/12 14:20:55 RemoteMirror RM_ADMIN_INFO_WILL_REBIND the object. B /09/12 14:20:55 RemoteMirror Quiesce request from peer SP. B /09/12 14:20:55 RemoteMirror Rebind request from peer SP for LUN b9502e00:fe606cb30d03e211. B /09/12 14:20:55 SnapCopy SnapView has been bound to device Disk0002. A /09/12 14:20: 'Create a SnapShot LU' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) A /09/12 14:20:56 SnapCopy a Snapshot Logical Unit device CopyDisk0002 has been created. B /09/12 14:20:56 SnapCopy a Snapshot Logical Unit device CopyDisk0002 has been created. A /09/12 14:20: '' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - sync-0_SRM-TEST-FAILOVER_session) A /09/12 14:20:59 SnapCopy SnapView persistent session sync-0_SRM-TEST-FAILOVER_session has been started on LUN 0. A /09/12 14:20:59 Bus1 Enc0 DskE a A logical unit has been enabled [ALU 3] ffff e B /09/12 14:20:59 Bus1 Enc0 DskE Unit Shutdown for Trespass [ALU 3] ffff e B /09/12 14:20:59 SnapCopy SnapView persistent session sync-0_SRM-TEST-FAILOVER_session has been started on LUN 0. A /09/12 14:21: 'Activate' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11 (sync-0_SRM-TEST-FAILOVER_session)) A /09/12 14:21: 'ExecuteClientRequest' called by ' Navi User admin' ( ) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 11 -snapshotname sync-0_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' f

30 Test Failover – sequence of events [DEMO]
CLARiiON CX Series Boot Process March 2005

31 Cleanup – sequence of events [PROD]
CLARiiON CX Series Boot Process March 2005 Recovery Step Result Step Started Step Completed Execution Time 1. Power Off Test VMs at Recovery Site Success :46:24 (UTC 0) :46:30 (UTC 0) Power Off :46:25 (UTC 0) Reset Storage Power Off :46:27 (UTC 0) Reset Storage 2. Resume Non-critical VMs at Recovery Site Inactive 3. Discard Test Data and Reset Storage :47:25 (UTC 0) 3.1. Protection Group test7 Device "Mirror of dellpr710-c.w2k8.emcvm...": Device "Mirror of dellpr710-c.w2k8.emcvmw.ctc RDM-23": Device "Mirror of dellpr710-c.w2k8.emcvmw.ctc SRM_VMs":

32 Cleanup– sequence of events [PROD ESX]
CLARiiON CX Series Boot Process March 2005 - Some fairly serious errors in the vmkernel on prod. Esx, these can be ignored. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba2:C0:T1:L3" Peer SP is hung. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba3:C0:T0:L3" Peer SP is hung. )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa da02e0012ce6a8f930de211 (path vmhba3:C0:T1:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa da02e0012ce6a8f930de211 (path vmhba2:C0:T0:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )NMP: nmp_DeviceUpdatePathStates:547: Activated path "vmhba2:C0:T1:L9" for NMP device "naa da02e0012ce6a8f930de211". - Watch out for messages like this, customers could open cases based on these errors alone…

33 Cleanup – sequence of events [DR ESX]
CLARiiON CX Series Boot Process March 2005

34 Cleanup – sequence of events [DR VNX]
CLARiiON CX Series Boot Process March 2005 A /09/12 14:46: 'storagegroup' called by ' Navi User admin' ( ) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname async-25_SRM-TEST-FAILOVER ') A /09/12 14:46: 'Stop' called by 'admin' ( ) on 'Session Name: async-25_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11 (async-25_SRM-TEST-FAILOVER_session)Stopped session su A /09/12 14:46:50 SnapCopy SnapView session async-25_SRM-TEST-FAILOVER_session has been stopped on LUN 25 with status of 0. B /09/12 14:46:50 SnapCopy SnapView session async-25_SRM-TEST-FAILOVER_session has been stopped on LUN 25 with status of 0. A /09/12 14:46:52 SnapCopy b Snapshot Logical Unit device CopyDisk0000 has been removed. A /09/12 14:46:52 NaviCimom Failing Command: Set LUN. B /09/12 14:46:52 SnapCopy b Snapshot Logical Unit device CopyDisk0000 has been removed. A /09/12 14:46: 'Destroy a SnapShot' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11) B /09/12 14:46:53 NaviCimom Failing Command: Set LUN. A /09/12 14:47: 'storagegroup' called by ' Navi User admin' ( ) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname sync-24_SRM-TEST-FAILOVER ') A /09/12 14:47: 'Stop' called by 'admin' ( ) on 'Session Name: sync-24_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11 (sync-24_SRM-TEST-FAILOVER_session)Stopped session succ A /09/12 14:47:01 Bus1 Enc0 DskE Unit Shutdown for Trespass [ALU 2] ffff e A /09/12 14:47:01 SnapCopy SnapView session sync-24_SRM-TEST-FAILOVER_session has been stopped on LUN 24 with status of 0. B /09/12 14:47:01 SnapCopy SnapView session sync-24_SRM-TEST-FAILOVER_session has been stopped on LUN 24 with status of 0. B /09/12 14:47:01 Bus1 Enc0 DskE a A logical unit has been enabled [ALU 2] ffff e A /09/12 14:47:02 SnapCopy b Snapshot Logical Unit device CopyDisk0001 has been removed. B /09/12 14:47:02 SnapCopy b Snapshot Logical Unit device CopyDisk0001 has been removed. A /09/12 14:47: 'Destroy a SnapShot' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11) A /09/12 14:47:03 RemoteMirror MirrorView quiesce LU request. A /09/12 14:47:03 RemoteMirror RM_ADMIN_INFO_WILL_REBIND the object. A /09/12 14:47:03 RemoteMirror MirrorView rebind request for LUN b9502e00:a4cc7fc81507e211.

35 Cleanup – sequence of events [DR VNX]
CLARiiON CX Series Boot Process March 2005 B /09/12 14:47:03 RemoteMirror Quiesce request from peer SP. B /09/12 14:47:03 RemoteMirror Rebind request from peer SP for LUN b9502e00:a4cc7fc81507e211. B /09/12 14:47:03 SnapCopy SnapView has been unbound from device Disk0001. A /09/12 14:47: 'storagegroup' called by ' Navi User admin' ( ) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname sync-0_SRM-TEST-FAILOVER ') A /09/12 14:47: 'Stop' called by 'admin' ( ) on 'Session Name: sync-0_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11 (sync-0_SRM-TEST-FAILOVER_session)Stopped session succes A /09/12 14:47:10 Bus1 Enc0 DskE Unit Shutdown for Trespass [ALU 3] ffff e A /09/12 14:47:10 SnapCopy SnapView session sync-0_SRM-TEST-FAILOVER_session has been stopped on LUN 0 with status of 0. B /09/12 14:47:10 SnapCopy SnapView session sync-0_SRM-TEST-FAILOVER_session has been stopped on LUN 0 with status of 0. B /09/12 14:47:10 Bus1 Enc0 DskE a A logical unit has been enabled [ALU 3] ffff e A /09/12 14:47: 'Destroy a SnapShot' called by 'admin' ( ) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11) A /09/12 14:47:12 RemoteMirror MirrorView quiesce LU request. A /09/12 14:47:12 RemoteMirror RM_ADMIN_INFO_WILL_REBIND the object. A /09/12 14:47:12 RemoteMirror MirrorView rebind request for LUN b9502e00:fe606cb30d03e211. A /09/12 14:47:12 SnapCopy b Snapshot Logical Unit device CopyDisk0002 has been removed. B /09/12 14:47:12 SnapCopy b Snapshot Logical Unit device CopyDisk0002 has been removed. B /09/12 14:47:12 RemoteMirror Rebind request from peer SP for LUN b9502e00:fe606cb30d03e211. B /09/12 14:47:12 RemoteMirror Quiesce request from peer SP. B /09/12 14:47:12 SnapCopy SnapView has been unbound from device Disk0002.

36 Cleanup – sequence of events [PROD]
CLARiiON CX Series Boot Process March 2005

37 CLARiiON CX Series Boot Process March 2005
Failover - DEMO CLARiiON CX Series Boot Process March 2005

38 Troubleshooting: Obtaining the correct logs
CLARiiON CX Series Boot Process March 2005 Ensure to capture the SRM & SRA logs. Please use VMWare KB “Export system Logs” Please complete these actions on both sites!

39 Troubleshooting: Obtaining the correct logs
CLARiiON CX Series Boot Process March 2005 If the issue is related to a Test Failover or actual Failover then having the failed Recovery Plan export log will also be invaluable in troubleshooting the issue. To generate the log Export for the failed Recovery Plan: In the left pane, click Recovery Plans and select the Recovery Plan which had the issue. Select the Plan Name which is showing an Error in the Result column. On the Plan Name with the error click the Export action to generate the report for the failed Test Failover or actual Failover. Save the file to your desktop and upload this file with the SRM system logs.

40 Troubleshooting: Obtaining the correct logs
CLARiiON CX Series Boot Process March 2005 Exported information will look like this, take note of the time stamps as this is what we will use to search thru the SPCOLLECT with The errors listed here are extremely useful in the actual diagnosis of the issue.

41 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 There are 2 main folder of interest within the exported log bundles: The Logs folder surprisingly enough: This will contain all the activity form the SRM application on that particular site. Extract all .gz archives in case the errors you are searching for a while back…. The mail file of interest in this folder is called “vmware-dr-XX” Sort by date and review most recent, or search for time stamp obtained form the html page described on slide 20.

42 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Please note the sate in the SRM logs are in the following format T09:10: :00 The dates in the exported .html page are in: :10:29 (UTC 0) So adjust accordingly when searching for errors across logs. In my example I’ll search thru the SRM logs with : T09 This will be a good start point For the Linux heads, this is what I’ using to make the logs more human readable: grep " T09" vmware-dr-3*|grep -v "<" |less

43 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Most on the information in these vmware-dr-XX logs are really of more interest to VMWare than EMC, as its just really verbose logging of the SRM application and database interaction. No harm I having a peek to see it there is anything n jumping out though. These log flies we want to next focus on is the SRA logs, and there are a few different logs. Location = srm-support\Logs\SRAs\EMC VNX SRA sra_discoverArrays_ _ sra_discoverDevices_ _ sra_failover_ _ sra_prepareFailover_ _ sra_prepareReverseReplication_ _ sra_queryCapabilities_ _ sra_queryConnectionParameters_ _ sra_queryErrorDefinitions_ _ sra_queryInfo_ _ sra_queryReplicationSettings_ _ sra_queryStrings_ _ sra_querySyncStatus_ _ sra_reverseReplication_ _ sra_syncOnce_ _ sra_testFailoverStart_ _ sra_testFailoverStop_ _

44 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Timestamp format again is similar here, so just take note of it. Get timestamp from .html page as before: The SRA log folder will be loaded full of many logs so I’m just going to focus on the logs form Sept The above error was received when trying to do a Test Failover. We need to check in the following log: sra_testFailoverStart_ _ log Note the UTC time adjustment.

45 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Looking at the log file, we can pull some very useful information: [sra_testFailoverStart_ _ log] grep / search for: com.emc.mirrorview.platform.naviseccli.NaviseccliConnection This will show the actual navi commands that are being issued by the SRA to the SP

46 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Within the same log file search / grep for : Command result: This should give a clear indication of where the error lies. In this case, looks like we have a issue with Snapview

47 Troubleshooting: log files of interest:
CLARiiON CX Series Boot Process March 2005 Switch over to the SPCOLLECT’s for both sites, and grep out any messages related to SnapCopy On the DR site, we can see: /cygdrive/c/Users/Dave/Documents.backup/Logs/SRM_LOGS_PPTX/Pandora $ grep SnapCopy "TRiiAGE_full_SPlogs.txt“ B /02/12 07:54:35 NaviCimom b Failing Command: K10SnapCopyAdmin DBid 0 Op 1046. A /02/12 07:54:43 NaviCimom b Failing Command: K10SnapCopyAdmin DBid 0 Op 1046. A /02/12 07:54: 'Create a SnapShot LU' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Failure (Could not create SnapShot LU.. [0x B] A SnapView snapshot already exists with the specified name (0x b)) A /02/12 07:54:45 SnapCopy You must add LUNs with adequate capacity to the Reserved LUN Pool before you can use this feature. A /02/12 07:54: '' called by 'admin' ( ) on 'Navi_SnapCopyFeature' with result: Failure (Could not start SnapView session. Session name - sync-0_SRM-TEST-FAILOVER_session. [0x ] You must add LUNs with adequate capacity to the Reserved LUN Pool before you A /02/12 07:54:46 NaviCimom Failing Command: K10SnapCopyAdmin DBid 0 Op 1038. This is indicating that there is no Reserve Lun Pool setup, as described on slide 13.

48 Troubleshooting: Logs needed - Recap.
CLARiiON CX Series Boot Process March 2005 So for all SRM / SRA cases you will need the following logs: SPCOLLECT form both sites SRM logs form both sites SRA logs form both sites. “Recovery Plan Export Log.html” as explained on slide 19 Seriously, don’t proceed until you have everything listed above.

49 Troubleshooting: Workflow
CLARiiON CX Series Boot Process March 2005 Ok, so for every SRM / SRA case that does go in, the following should apply as valid workflow towards resolution of the case. Collect Logs Check MirrorView & confirm it is actually working. Have customer reconfirm DNS & IP connectivity is OK all hosts should be DNS resolvable on both sites All hosts / SP’s should have IP connectivity on same VLAN, all hosts / SP’s should be able to ping each other… Confirm Software requirements listed on Slide 7 Check error that is reported in Recovery Plan Export Log.html Search in TRiiAGE_full_Splogs for any errors at the time reported on the Recovery Plan Export Log.html

50 Log error / message examples
CLARiiON CX Series Boot Process March 2005 In this section we will provide examples of errors and informative messages that may assist in troubleshooting your issue.

51 Log error / message examples
CLARiiON CX Series Boot Process March 2005 Search for errors that are returned form the Navi commands within the SRA logs. In particular : look for Command result: stdout(Error: Search the whole folder of SRA logs as you will get hits on different files depending on the issue you are having. Some examples: [sra_testFailoverStart_ _ log] :40:31,693 [com.emc.mirrorview.platform.snapshot.HashMapSnapviewSnapshotRepository]: Caching SnapView snapshot with name sync-0_SRM-TEST-FAIL OVER :40:31,693 [com.emc.mirrorview.platform.snapshot.session.SnapviewSessionServiceImpl]: Starting SnapView session with name: sync-0_SRM-TEST-FA ILOVER_session, for snapshot: sync-0_SRM-TEST-FAILOVER :40:31,693 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -startsession "sync-0_SRM- TEST-FAILOVER_session" -snapshotname "sync-0_SRM-TEST-FAILOVER" -persistence :40:33,003 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(Error: snapview command failed You must add LUNs with adequate capacity to the Reserved LUN Pool before you can use this feature. (0x )), stderr() :40:33,003 [com.emc.mirrorview.platform.snapshot.session.SnapviewSessionServiceImpl]: Retrieving info for SnapView session with name sync-0_SRM-TEST-FAILOVER_session :40:33,003 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -listsessions -name "sync-0_SRM-TEST-FAILOVER_session"

52 Log error / message examples
CLARiiON CX Series Boot Process March 2005 sra_discoverArrays_ _ log :15:51,457 [com.emc.sra.SraController]: Building SRM command response... :15:51,457 [com.emc.sra.ResponseBuilder]: Building discoverArrays response... :15:51,473 [com.emc.sra.mirrorview.MirrorviewCommands]: MirrorView Enabler Version: :15:51,473 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: arrayname :15:51,504 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(), stderr('naviseccli' is not recognized as an internal or external command, operable program or batch file.) :15:51,504 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Unknown error occurred while opening naviseccli connection. :15:51,504 [com.emc.sra.mirrorview.MirrorviewCommands]: Unable to connect using SPA, trying SPB... :15:51,504 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: arrayname :15:51,520 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(), stderr('naviseccli' is not recognized as an internal or external command, :15:51,520 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Unknown error occurred while opening naviseccli connection. :15:51,520 [com.emc.sra.ResponseBuilder]: Unable to get SRA Enabler for this connection info Unable to get SRA Enabler for this connection infocom.emc.sra.ResponseBuilder.getEnabler(ResponseBuilder.java:346) It would seem form the above that Naviseccli is not installed properly on SRM host, check path!

53 Log error / message examples
CLARiiON CX Series Boot Process March 2005 If MirrorView is working is would look like this in sra_discoverDevices_ _ log :20:32,823 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: mirror -sync -info -systems :20:34,102 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(Remote systems that can be enabled for mirroring: Remote systems that are enabled for mirroring: Array UID: 50:06:01:60:C7:20:0A:2D Status: Enabled on both SPs), stderr() :20:34,102 [com.emc.sra.response.ReplicatedDevicesBuilder]: Attempted discovery of peer array:50:06:01:60:C7:20:0A:2Dfailed :20:34,102 [com.emc.mirrorview.platform.mirror.MirrorServiceImpl]: ************* SYNC MIRRORS *************** :20:34,102 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: mirror -sync -list :20:35,381 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(MirrorView Name: Mirror of dellpr710-c.w2k8.emcvmw.ctc Datastore_1 MirrorView Description: MirrorView UID: 50:06:01:60:BE:A0:39:93:03:00:00:00:00:00:00:00 Logical Unit Numbers: 0 Remote Mirror Status: Mirrored MirrorView State: Active MirrorView Faulted: NO MirrorView Transitioning: NO Quiesce Threshold: 60 Minimum number of images required: 0 Image Size: Image Count: 2 Write Intent Log Used: YES Images: Image UID: 50:06:01:60:BE:A0:39:93 Is Image Primary: YES Logical Unit UID: 60:06:01:60:9D:A0:2E:00:52:49:31:9F:C9:D7:E1:11 Image Condition: Primary Image Preferred SP: A Image UID: 50:06:01:60:C7:20:0A:2D Is Image Primary: NO Logical Unit UID: 60:06:01:60:B9:50:2E:00:FA:DE:7D:16:AB:F1:E1:11 Image State: Synchronized Image Condition: Normal Recovery Policy: Manual Synchronization Rate: Medium Image Faulted: NO Image Transitioning: NO Synchronizing Progress(%): 100), stderr()

54 Log error / message examples
CLARiiON CX Series Boot Process March 2005 sra_discoverDevices_ _ log :55:13,486 [com.emc.mirrorview.platform.snapshot.SnapviewSnapshotServiceImpl]: Retrieving SnapView snapshot information... :55:13,486 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -listsnapshots :55:13,861 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(This version of Core Software does not support Snapview), stderr() :55:13,861 [com.emc.mirrorview.platform.snapshot.HashMapSnapviewSnapshotRepository]: Caching SnapView snapshot with name null :55:13,861 [com.emc.sra.SraController]: Writing XML response... In the above example, I did not have the correct Snapview enabler installed on the VNX, there is a new one for INYO. Reference Slide 7

55 Log error / message examples
CLARiiON CX Series Boot Process March 2005 sra_testFailoverStart_ _ log :43:58,722 [com.emc.mirrorview.platform.snapshot.SnapviewSnapshotServiceImpl]: Retrieving SnapView snapshot information... :43:58,722 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -listsnapshots :43:59,923 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(SnapView logical unit name: syn c-0_SRM-TEST-FAILOVER SnapView logical unit ID: 60:06:01:60:54:50:2E:00:78:4E:71:B9:BA:F2:E1:11 Target Logical Unit: 0 State: Inactive), stderr() :43:59,923 [com.emc.mirrorview.platform.snapshot.HashMapSnapviewSnapshotRepository]: Caching SnapView snapshot with name sync-0_SRM-TEST-FAIL OVER :43:59,923 [com.emc.mirrorview.platform.snapshot.SnapviewSnapshotServiceImpl]: Creating SnapView snapshot with name: sync-0_SRM-TEST-FAILOVER , of LUN: 0 :43:59,923 [com.emc.mirrorview.platform.snapshot.SnapviewSnapshotServiceImpl]: Searching for current SP owner of lun: 0 :43:59,923 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -createsnapshot 0 -snapsho tname "sync-0_SRM-TEST-FAILOVER" :44:02,029 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(Error: snapview command failed A SnapView snapshot already exists with the specified name (0x b)), stderr() :44:02,029 [com.emc.mirrorview.platform.snapshot.session.SnapviewSessionServiceImpl]: Starting SnapView session with name: sync-0_SRM-TEST-FAILOVER_session, for snapshot: sync-0_SRM-TEST-FAILOVER :44:02,029 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Executing command: snapview -startsession "sync-0_SRM-TEST-FAILOVER_session" -snapshotname "sync-0_SRM-TEST-FAILOVER" -persistence :44:03,355 [com.emc.mirrorview.platform.naviseccli.NaviseccliConnection]: Command result: stdout(Error: snapview command failed You must add LUNs with adequate capacity to the Reserved LUN Pool before you can use this feature. (0x )), stderr() Both of the above errors were experienced when I did not have the Reserved Pool setup.

56 Log error / message examples [ESX]
CLARiiON CX Series Boot Process March 2005 From the esx side we will see errors in the vmkernel during a HBA rescan of the MirrorView devices: This is expected behaviour T13:11:19.187Z cpu9:2057)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x ba440) to dev "naa b9502e00a4cc7fc81507e211" on path "v mhba5:C0:T0:L5" Failed: H:0x0 D:0x2 P:0x2 Possible sense data: 0x5 0x25 0x1.Act:NONE T13:11:19.187Z cpu9:2057)ScsiDeviceIO: 2316: Cmd(0x ba440) 0x28, CmdSN 0x5c5e6 to dev "naa b9502e00a4cc7fc81507e211" failed H:0x0 D:0x2 P:0x2 Possible sense data: 0x5 0x25 0x1. T13:11:19.187Z cpu14:2833)Partition: 484: Read of GPT header failed on "naa b9502e00a4cc7fc81507e211": I/O error T13:11:19.188Z cpu9:2057)ScsiDeviceIO: 2316: Cmd(0x ba440) 0x28, CmdSN 0x5c5e7 to dev "naa b9502e00a4cc7fc81507e211" failed H:0x0 T13:11:19.188Z cpu14:2833)WARNING: Partition: 944: Partition table read from device naa b9502e00a4cc7fc81507e211 failed: I/O error

57 Log error / message examples [ESX]
CLARiiON CX Series Boot Process March 2005 Taking a closer look at those scsi sense codes: H:0x0 D:0x2 P:0x2 Possible sense data: 0x5 0x25 0x1

58 CLARiiON CX Series Boot Process March 2005
Questions

59 CLARiiON CX Series Boot Process March 2005


Download ppt "CLARiiON CX Series Boot Process March 2005"

Similar presentations


Ads by Google