Exchange Split Brain … On Purpose?

February 26, 2012 5 By Eric Shanks

I was recently tasked with performing a company wide disaster recovery test.  The test had the normal goals with a standard recovery time objective, and recover point objectives.  Unfortunately, the test needed to be performed during the middle of a production day, and not affect production.  Under normal circumstances we could assume that our production servers were disabled or destroyed in some manner and we could power up our DR servers and continue the business.  During this test however we needed to make sure that both networks could run at the same time.

Network design

Both the Production network and DR network were using Netapp filers and VMware ESXi servers.  VMware Site Recovery Manager was used in conjunction with Netapp Snapmirror to keep the data in sync between the two sites.  During a DR test, SRM could be used to fail the production servers over to the secondary site and put the servers in a isolated “bubble” network, so that a server couldn’t communicate with it’s clone in the production site.

The trick came to the Exchange servers which were not failed over with SRM.  The production site had two Exchange 2010 mailbox servers and the DR site had one mailbox server.  All of these servers were in a Database Availability Group configuration.  The DAG was used so that mailboxes could be activated in the DR site if there was an Exchange outage in the primary site, but not a total disaster happened.

Split Brain

In order to achieve the objectives I was given, I severed the WAN connection to stop communication between the two sites and then did a normal SRM “test” failover.  This failed our servers over to the DR Site.  It also shutdown the Domain Controller in the DR site, cloned it and then started the clone.  (This was done so that any changes we wanted to make on the network could be made and the clone could be destroyed when the test was over.  If this wasn’t done, any changes in AD would then replicate back to the Production site when the WAN link was restored).  Lastly, the Exchange server in DR was also snapshotted so that when the DR test was over, we could undo all of the Exchange failover changes that we made.  All of these changes were done automatically from PowerCLI scripts called from SRM.

Once the other servers were up and running and the DC was cloned, we needed to get the Exchange databases mounted.  Depending on your Exchange setup, you would need to use one of two methods Listed Below.

Exchange Failover Procedures

Check your DatacenterActivationMode from the Exchange Powershell console.  In our case DAC was off.  Similar procedures could be followed for Mailbox servers in DAC = On mode.

  1. DatacenterActivationMode = off

“Get-DatabaseAvailabilityGroup [DAGNAME] | FL”  and make sure that the server is set to “off” for the DatacenterActivationMode

-Stop the Cluster Service on the DR Mail server.  You cannot failover the DAG and force quorum if the cluster service is running.

Once the service has stopped, you should run:  Net start clussvc /forcequorum

Next open the Microsoft Failover Cluster Console.

You will want to expand the DAG and go to nodes:

Right click on one of the DAG members from the failed site and choose evict.

Then repeat with all the remaining DAG Members from the failed site.

Open an Exchange Management Shell under elevated permissions.

Run:  Cluster [DAGNAME] /quorum /nodemajority

This command will make the failover server the only member of the DAG and force it to have quorum for the DAG.

Then run:

Move-ActiveMailboxDatabase –identity [MailboxDatabasename] –ActivateOnServer [DRMailServerRoom] –MountDialOverride BestEffort –SkipActiveCopyChecks –SkipClientExperienceChecks –SkipLagChecks

This command should mount all of the databases hosted on this DAG Member.

Once the switchover is complete, run “get-mailboxdatabasecopystatus” to make sure that the databases are mounted properly.

 At this point you should be able to see the databases mounted in the Exchange Management Console as well.

Testing

Once Exchange Mailboxes were mounted we could then test AD, applications failed over using SRM and Email to an extent.  We could then open outlook, connect to the DR Site mail server and see our mail.  We could send out mail, but we couldn’t receive email from outside the DR Network unless we used a different domain suffix. (remember that our primary MX record was still forwarding mail to the production site which was still up and available).

Cleanup

Once all of the testing was done, we could finish the SRM Test, revert the Exchange server back to the snapshot, delete the AD Clone and power on the original AD Server.  Once all the servers were back to normal, we could re-establish the WAN link and replication etc would start back up as normal.  No one would know the difference.

Troubleshooting

Sites with the best resources for failing over Exchange.

http://technet.microsoft.com/en-us/library/dd351049.aspx

http://blogs.msexchange.org/walther/2011/10/09/real-life-exchange-2010-site-switchover-lesson-learned/