How CCR and SCR provide High Availability in Exchange Server 2007 SP1

How CCR and SCR provide High Availability in Exchange Server 2007 SP1
Scott Schnoll Principal Technical Writer Exchange Server Microsoft Corporation © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How CCR and SCR provide High Availability in Exchange Server 2007 SP1
Ilse Van Criekinge Exchange MVP, Trainer & Consultant Microsoft Unified Communications Global Knowledge © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Agenda Mailbox Server High Availability Options
CCR and SCR: Better Together Why CCR? Why not SCC? Continuous Replication Demystified Troubleshooting Exchange Clusters and Continuous Replication Known Issues

Mailbox Server High Availability Options
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Local Continuous Replication (LCR)

Single Copy Cluster (SCC)

Cluster Continuous Replication (CCR)

Standby Continuous Replication
SCR Sources SCR Targets CCR Standalone Mailbox Server (w/o LCR) Standalone Standby Cluster with Passive Mailbox Role SCC

CCR and SCR: Better Together

CCR and SCR: Better Together
CCR provides high-availability for Mailbox data and services within the datacenter SCR replicates data remotely to provide site resilience for the Mailbox data Datacenter A Datacenter B

CCR across 2 Sites Datacenter A Datacenter B

CCR local / SCR to remote Site
Datacenter A Datacenter B

CCR/SCR vs SCC/Sync – 2 sites
Datacenter A Datacenter B CCR Log corruption detected immediately on replication at both targets Setup /recovercms, play logs forward Logs DB Physical Corruption Logs DB DB Logs SCC Exchange Disaster Recovery or 3rd Party Failover On Site Failure in Primary Site, if corruption not detected and corrected from a test failover, must Recover from Backup On full Storage or Site Failure in Primary Site, corruption is detected, must Recover from Backup Undetected Physical Corruption Physical Corruption Physical Corruption VSS Clone VSS Clone Q Logs DB DB Logs 1 month later, Undetected Physical Corruption

Why CCR? Why Not SCC? © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Why CCR? Why not SCC? CCR SCC Single Point of Failure
None when stretched across sites or combined with SCR for site resiliency Data, Storage and Site single points of failure Potential for massive data loss on single failure: Storage device failures can lose collocated backups Hardware replication can propagate physical errors Storage failure requires activation of remote copy if one exists SCC requires two VSS clones plus a remote copy of data to achieve RPO equal to CCR Simplicity Simple setup No special storage configuration required Built-in Site Resilience Same technology and redundancy model for intra- and inter-site protection Shared storage Storage configuration before and after forming cluster Complex storage stack Driver mgmt Cluster WCL Switches Multipathing Queue depths Complex deployment to approach RTO/RPO of 1 CCR cluster

Why CCR? Why not SCC? TCO Large Mailboxes CCR SCC Backups
Backups off passive copy eliminates/reduces backup window Backups must be off active TCO Reduced TCO Cheaper hardware No special storage expertise required In-the-box solution Integrated management Single operations team Reduced backup cost Higher TCO Additional products needed to achieve equivalent combined RTO/RPO Separate management tools for HA operations may be required Higher-end servers and storage required Storage expertise needed Large Mailboxes Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes Higher TCO, long recovery times constrain mailbox size

Why CCR? Why not SCC? SCC CCR Failure
Stretched CCR or CCR + SCR SCC SCC + SCR/3rd party replication + 2 VSS clones to approach combined RTO/RPO of 1 CCR cluster RTO Server ~ 2 minutes Data or LUN 15 min – 1 hour Full Storage ~ 15 min with synchronous replication Days with VSS clones only Site ~ 2 minutes for Stretched CCR 30-60 minutes for CCR + SCR RPO 0 for mail* appointment, contact, task, draft 0 – uses same copy of data Physical Corrupt DB Hours to days if sync repl; point in time if VSS Logs 0 (must reseed passive) N/A if log not needed; same as DB if needed DB LUN dies 0 with synchronous replication Point-in-time with VSS clones LOG LUN dies Hours to days with VSS clones only Same as Server for Stretched CCR 1 Log** Hours to days with VSS clone * Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up

Why CCR? Why not SCC? Logical Corruption Physical Corruption
Corruptions caused by the application Logical corruption replicated by all synchronous and asynchronous replication solutions SCR with lag replay can mitigate if detected early Logical Corruption Physical Corruption SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutions (e.g., Backups) SCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutions (e.g., log inspection) With hardware-based replication, deeper stack can lead to corruption caused by: HBA driver/firmware Multi-path driver server hardware FC Switch firmware Storage controller firmware/OS target Storage controller firmware/OS

Continuous Replication Demystified

Basic Replication Pipeline
Source DB Store Log Copier Log Inspector Replica Log Directory Inspector Directory Source Log Directory Log Replayer Replica DB

Continuous Replication Basics
When current log file is closed, it is copied to the replication target by the Replication service Replication service at source: creates read-only shares for log directory at target: reads from the shares and pulls a copy of the log file contains a ReplicaInstance for each storage group Configuration discovered from Active Directory (every 30 sec for LCR/CCR, every 3 min for SCR)

Continuous Replication Basics
Communication is done via logs, registry, cluster database and RPC Logs: replicate database changes and backup status Registry: used in LCR and SCR. Also in CCR for checkpointing the current log generation value for loss calculation Cluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replay RPCs: Target Replication service RPCs into Store for log truncation coordination

Lost Log Resilience (LLR)
Designed to minimize need to reseed after lossy failover Database changes written to log file prior to database, and the database can be updated as soon as change is logged LLR modifies this behavior by delaying updates to the database until 1 or more log generations are created Utilizes a new log stream marker called the waypoint Minimum Log Required to prevent database divergence No modifications after the waypoint have been written to the database

Transaction Markers Committed: Log generation 20
Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-0x14) Committed: Log generation 20 Checkpoint: Log generation 2 Waypoint: Log generation 10 What this means: We only need logs 2-10 Logs can be discarded

NodeA NodeB waypoint checkpoint Healthy CCR
21 21 18 19 20 21 21 20 20 19 19 NodeA fails and a failover to NodeB occurs 18 18 17 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 checkpoint waypoint Validate database can mount logs lost < AutoDatabaseMountDial Logs are generated on NodeB (beyond gen21) NodeA recovers and performs a divergence check NodeA performs incremental reseed and copies logs Healthy CCR

Maximum number of logs generated each day due to log roll activity
In the absence of user or database activity, ESE now also forces the active log file to close [15 (minutes) ÷ LLR Depth value] = Frequency of log roll activity (in minutes) Maximum number of logs generated each day due to log roll activity Mailbox server configuration Maximum number of logs generated per day by an idle storage group Stand-alone (with or without LCR) SCC 96 CCR 960

When Do I Need A Full Reseed?
Rarely Lost log past current Waypoint Admin accepted large amount of loss by running Restore-StorageGroupCopy Automatic mount while LLR was “not honored” Automatic lossy mount with “stale” loss window calculation Log corruption prior to log replay ESE cannot skip over logs Database files modified outside of Store or Replication service E.g., Offline defrag, eseutil /r

4/25/2017 3:20 AM Transport Dumpster Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reached Transport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage group Transport Dumpster statistics: Get-StorageGroupCopyStatus -DumpsterStatistics Output: DumpsterServersNotAvailable:{HUB1} DumpsterStatistics: {HUB2(2/25/ :20:37 PM; 2 ; 1032KB)} © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Transport Dumpster CCR CMS MBX1 HUB1 Active MBX2 HUB2 Passive
SG Dumpster Contents SG1 Msg1 SG2 Msg1,Msg3 SG Dumpster Contents SG1 Msg1 SG2 SG Dumpster Contents SG1 SG2 SG1 SG2 Active MBX2 Redeliver SG1,SG2(returns retry) Redeliver SG1,SG2(returns timeout) Redeliver SG1,SG2(returns success) HUB2 SG1 SG2 SG Dumpster Contents SG1 Msg2,Msg4 SG2 Msg4 SG Dumpster Contents SG1 SG2 SG Dumpster Contents SG1 Msg2 SG2 Passive SG Resubmit Required SG1 SG2 SG Resubmit Required SG1 HUB1,HUB2 SG2 SG Resubmit Required SG1 HUB1 SG2 Redeliver SG1,SG2(returns Retry) Redeliver SG1,SG2(returns Success)

Transport Dumpster Customize transport dumpster size/time limit
How much data loss can transport dumpster mitigate? 18 MB dumpster per storage group on 8 Hub Transport servers = 144 MB / storage group [20 MB / 10 hour] x [100 users / SG] = 200 MB message traffic in one hour Putting the above two together gives 60 min X 144 / 200  43.2 minutes worth of data in 43.2 minutes  144+ logs created per SG Customize transport dumpster size/time limit Set-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB –MaxDumpsterTime 07.00:00:00 No time window guarantees If there are no message size limits, a single large message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server

Transport Dumpster When CCR detects a lossy failover:
Expands loss window by 12 hours back and 1 hour forward Finds all Hub Transport servers in the local Active Directory site Requests transport dumpster redelivery from all detected servers New servers not added to redelivery list Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime If multiple lossy failovers take place, new loss is window added to previous one Restore-StorageGroupCopy on LCR is one time request, no retries Redelivery not triggered as part of Setup /recoverCMS No other ways to redeliver messages from transport dumpster

Redundant Networks Use for log shipping and seeding in CCR
Enable-ContinuousReplicationHostName Seeding Update-StorageGroupCopy -DataHostNames:Host1,Host2 Get-ClusteredMailboxServerStatus OperationalReplicationHostNames: FailedReplicationHostNames: InUseReplicationHostNames: Watch out for misconfigured host file

Circular Logging One configuration setting with two consumers
Store service: requires database to be dismounted and re-mounted to take effect Replication service: picks up new setting dynamically In CCR, it’s no big deal to switch between on/off/on In some settings, logs are deleted prematurely Example: turn off circular logging, then enable LCR without dismount/mount of database ESE is still doing log truncation with circular logging logic Logs will get truncated before making it to the LCR copy To be safe follow this recipe: Suspend, dismount, change setting, mount, resume

Troubleshooting Exchange Clusters and Continuous Replication

Troubleshooting Replication & Failover
Get-StorageGroupCopyStatus Test-ReplicationHealth Cluster Log Get-ClusteredMailboxServerStatus Getscrsources.ps1 Test-Mailflow Application Event Log – Replication events Get-EventLogLevel -id:"MSExchange Repl" | Set-EventLogLevel -Level expert Get-EventLogLevel -id:"MSExchange Cluster" | Set-EventLogLevel -Level expert System Event Log – Cluster events Active Directory management tools Network Monitor

Troubleshooting Get-StorageGroupCopyStatus
= LastLogCopyNotified – LastLogCopied Time stamp on source SG of most recent log Time of sources most recent log known to copy Time stamp on source SG of last successful log copy Must use –DumpsterStatistics option to get these values

Troubleshooting Test-ReplicationHealth
ClusterNetwork: Checks connectivity of all network interfaces Checks cluster group is up Warns in multi subnet topologies since not all cluster networks can be up at the same time Troubleshooting Test-ReplicationHealth SGCopyQueueLength Warns at 3 and Errors at 6 SGReplayQueueLength Warns at 30 and Errors at 60

Troubleshooting Cluster Log
Windows Server 2003 %windir%\cluster\cluster.log Logs are always appended to this file Windows Server 2008 Must generate the cluster log file cluster.exe [[/CLUSTER:]cluster-name] LOG <options> <options> = /G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"] [/SPAN[MIN[UTE[S]]]:min] ] /SIZE:logsize-MB /LEVEL:logLevel If /COPY is not specified, %windir%\Cluster\Reports\Cluster.log If /NODE is not specified, a log file is generated on every node /SIZE must be between 8 and 1024 MB /LEVEL must be between 0 and 10

Troubleshooting Server Failover but Databases Didn’t Mount
Steps to troubleshoot: Run Get-StorageGroupCopyStatus Check the log directories on Active and Passive Run Restore-StorageGroupCopy and then Mount-Database

Troubleshooting Log File Corrupted
Steps to troubleshoot: Run Get-StorageGroupCopyStatus and/or Test-ReplicationHealth Reseed passive copy / SCR target by running Suspend-StorageGroupCopy Run Update-StorageGroupCopy on the passive node or SCR target

Troubleshooting SMB File Share for Replication Missing
Steps to troubleshoot (SCR/CCR) Run Test-ReplicationHealth on Passive Run Get-StorageGroupCopyStatus on Passive Run Get-ClusteredMailboxServerStatus Verify share on Active Node Stop Sharing the File Share – Replication Service recreates in 30 seconds Run Test-ReplicationHealth on Active Check Application Event Log Check Active Directory Permissions

Known Issues © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Known Issues Update Rollup 5 for Exchange 2007 SP1 can cause Enable-StorageGroupCopy to fail in an SCR topology that consists of a parent and child domain structure: “Standby continuous replication is not supported between computers in different Active Directory domains. The target node is in domain <child domain> which is different from the source domain of <parent domain>” Workarounds Uninstall UR5, enable SCR, re-install RU5 Use a Management Console running pre-RU5 code Exchange12 bug Expected fix in RU7 for Exchange 2007 SP1

Known Issues Network shares get deleted and created every 5 minutes by the replication service on a Windows 2008 SCC when SCR is enabled Replication service share names intermittently disappear from the cluster causing replication status to repeatedly switch back and forth between failed and healthy states Test-ReplicationHealth on SCR target may succeed showing all tests passed Get-StorageGroupCopyStatus on SCR target status shows Healthy, Initializing or Failed No events on source, but SCR target will log ESE event 522 in the application event log Exchange12 bug Expected fix in RU7 for Exchange 2007 SP1

Known Issues When running VSS backup, ESE event 522 is logged on the passive node; Event is logged on resuming a suspended storage group Event log fills Event message details: Microsoft.Exchange.Cluster.ReplayService (7012) Log Verifier e0a : An attempt to open the device name "\\source\share$" containing "\\source\share$\" failed with system error 5 (0x ): "Access is denied. ". The operation will fail with error (0xfffffbf8). Workaround If Get-StorageGroupCopyStatus is healthy for storage groups, ignore the event If Test-ReplicationHealth passes all tests, ignore the event Exchange12 bug Expected fix in RU7 for Exchange 2007 SP1

Known Issues Reseed fails when you restore 1 full backup and then more than 2 differential backups. Restoring to active node can succeed, but CCR no longer works after recovery. Workaround Take full backup when restore is finished (note this may not be practical with large databases) Exchange12 bug Expected fix in RU8 for Exchange 2007 SP1

Key Takeaways Exchange 2007 includes several Mailbox Server availability configurations CCR+SCR provide higher availability at a lower cost than any other solution There are a number of cmdlets and tools that can be used for troubleshooting and managing continuous replication LLR minimizes need for full reseeds Transport Dumpster redelivers all routed mail after failover CCR addresses all ranges of failures from disk to full site CCR on DAS provides great RTO and RPO

Thank You!

Pro-Exchange Here this week! Visit our usergroup booth Core members
Ilse Van Criekinge Johan Delimon Tonino Bruno

4/25/2017 3:20 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How CCR and SCR provide High Availability in Exchange Server 2007 SP1

Similar presentations

Presentation on theme: "How CCR and SCR provide High Availability in Exchange Server 2007 SP1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How CCR and SCR provide High Availability in Exchange Server 2007 SP1

Similar presentations

Presentation on theme: "How CCR and SCR provide High Availability in Exchange Server 2007 SP1"— Presentation transcript:

Similar presentations

About project

Feedback