Exchange Split Brain … On Purpose?
February 26, 2012I was recently tasked with performing a company wide disaster recovery test. The test had the normal goals with a standard recovery time objective, and recover point objectives. Unfortunately, the test needed to be performed during the middle of a production day, and not affect production. Under normal circumstances we could assume that our production servers were disabled or destroyed in some manner and we could power up our DR servers and continue the business. During this test however we needed to make sure that both networks could run at the same time.
Network design
Both the Production network and DR network were using Netapp filers and VMware ESXi servers. VMware Site Recovery Manager was used in conjunction with Netapp Snapmirror to keep the data in sync between the two sites. During a DR test, SRM could be used to fail the production servers over to the secondary site and put the servers in a isolated “bubble” network, so that a server couldn’t communicate with it’s clone in the production site.
The trick came to the Exchange servers which were not failed over with SRM. The production site had two Exchange 2010 mailbox servers and the DR site had one mailbox server. All of these servers were in a Database Availability Group configuration. The DAG was used so that mailboxes could be activated in the DR site if there was an Exchange outage in the primary site, but not a total disaster happened.
Split Brain
In order to achieve the objectives I was given, I severed the WAN connection to stop communication between the two sites and then did a normal SRM “test” failover. This failed our servers over to the DR Site. It also shutdown the Domain Controller in the DR site, cloned it and then started the clone. (This was done so that any changes we wanted to make on the network could be made and the clone could be destroyed when the test was over. If this wasn’t done, any changes in AD would then replicate back to the Production site when the WAN link was restored). Lastly, the Exchange server in DR was also snapshotted so that when the DR test was over, we could undo all of the Exchange failover changes that we made. All of these changes were done automatically from PowerCLI scripts called from SRM.
Once the other servers were up and running and the DC was cloned, we needed to get the Exchange databases mounted. Depending on your Exchange setup, you would need to use one of two methods Listed Below.
Exchange Failover Procedures
Check your DatacenterActivationMode from the Exchange Powershell console. In our case DAC was off. Similar procedures could be followed for Mailbox servers in DAC = On mode.
- DatacenterActivationMode = off
“Get-DatabaseAvailabilityGroup [DAGNAME] | FL” and make sure that the server is set to “off” for the DatacenterActivationMode
-Stop the Cluster Service on the DR Mail server. You cannot failover the DAG and force quorum if the cluster service is running.
Once the service has stopped, you should run: Net start clussvc /forcequorum
Next open the Microsoft Failover Cluster Console.
You will want to expand the DAG and go to nodes:
Right click on one of the DAG members from the failed site and choose evict.
Then repeat with all the remaining DAG Members from the failed site.
Open an Exchange Management Shell under elevated permissions.
Run: Cluster [DAGNAME] /quorum /nodemajority
This command will make the failover server the only member of the DAG and force it to have quorum for the DAG.
Then run:
Move-ActiveMailboxDatabase –identity [MailboxDatabasename] –ActivateOnServer [DRMailServerRoom] –MountDialOverride BestEffort –SkipActiveCopyChecks –SkipClientExperienceChecks –SkipLagChecks
This command should mount all of the databases hosted on this DAG Member.
Once the switchover is complete, run “get-mailboxdatabasecopystatus” to make sure that the databases are mounted properly.
At this point you should be able to see the databases mounted in the Exchange Management Console as well.
Testing
Once Exchange Mailboxes were mounted we could then test AD, applications failed over using SRM and Email to an extent. We could then open outlook, connect to the DR Site mail server and see our mail. We could send out mail, but we couldn’t receive email from outside the DR Network unless we used a different domain suffix. (remember that our primary MX record was still forwarding mail to the production site which was still up and available).
Cleanup
Once all of the testing was done, we could finish the SRM Test, revert the Exchange server back to the snapshot, delete the AD Clone and power on the original AD Server. Once all the servers were back to normal, we could re-establish the WAN link and replication etc would start back up as normal. No one would know the difference.
Troubleshooting
Sites with the best resources for failing over Exchange.
Interesting.
Why would you want to “mix” the two DR methodologies? I would think that either you use DAG to handle DR for exchange, or you use SRM. If you are using SRM for exchange DR, then you would NOT use a DAG between the two sites at all, and the replicated Exchange system would simply be another protected VM
Can you help me understand the down-sides of doing this way, instead of using the methods that you have chosen?
Sure.
It would be much simpler to fail over the Exchange system with SRM. I would recommend doing this unless the RTO is very short.
In my case we were using a DAG so that we could immediately fail over the mail server without waiting for VMs to power on etc. This is partially due to our test, which would have to also wait for flexclones on the storage system to be mounted by vSphere. Also, if you lost your Exchange systems in the primary site, but didn’t declare a disaster and fail over your datacenter, you could still mount the Exchange store in the DR site.
I’m not sure the likelihood of this happening, but it is a possibility.
Again, the most straightforward way to handle the Exchange failover, is to use SRM and keep your life simple.
Thnx
I guess there are a couple of concerns when using SRM to protect Exchange:
1)RTO as you said – using a DAG would provide faster recovery for mail clients, and
2)database RPO – a DAG would provide a transactionally consistent DB, but SRM (by leveraging vmwtools) can achieve only Crash Consistency for the DBs
I am not an Exchange expert – so I do not really know what the impact is of recovering from a crash consistent DB – are there any special concerns, or is Exchange resilient enough that some transactions may be lost, but it will recover the DB just fine?
Peter
In my case, the DR Site store would be the only store left if there was a disaster, so I don’t see an issue with SRM handling the failovers. What happens if you have a third site that is a DAG member too? I’m not sure, I would think that the third site would need to be activated and mounted first; then the Exchange server that was “SRMed” would then need to become passive and have the DAG replicate the rest of the changes before it was mounted.
I will tell you that it is possible to mount the Exchange database if you accept a certain amount of data loss.
I just wish VMWtools was able to do log management, so that a snapshot would be fully transactionally consistent – that would be really great.
As for a 3rd DAG member – I have no idea, as I said, not being an Exchange dude exactly.
But thank you for sharing your experience on this – I take it you have actually had to mount a “crash consistent” Exchange DB and that it worked fine? I wonder if that has generally been the experience amongst Exchange admins – or, is depending on a “crash consistent” backup just a roll-of-the-dice kind of strategy?
Please join me in requesting that VMware include Exchange and SQL log management in VMWtools, so that we don’t have to guess and hope on this stuff!