VMware SRM Gotchas

I recently presented my current employers DR Strategy at the Chicago Vmug and had several comments about the gotchas section so I thought I’d get them on the blog for future reference.

During our DR Test we found several items that need to be carefully considered when doing a failover to a secondary site. It is my hope that this post provides a good starting point for considering your own DR Strategy using VMware Site Recovery Manager.

Defrags - If you defrag a virtual disk that is being replicated to your DR site, you may increase some performance on your production VM, but it also means that you’ll be replicating almost all of your blocks to the DR site over again. This might eat up your bandwidth quickly, especially if you’re doing defrags on multiple protected VMs.
Database Maintenance - Along with defrags, you may have a very similar situation with databases. Re-indexing a database might cause a tremendous number of blocks to be changed, causing a lot of additional replication to take place. Perhaps instead of replicating your database with block level replication strategies, maybe try some sort of log shipping or database mirrioring techniques.
Snapshots - If you’re keeping snapshots on your replicated datastore, those snapshots are being replicated to the DR site as well. Maybe this is intended, but be careful to keep track of the size and number of snapshots you’re keeping.
Database File Replication - Make sure to replicate your database files at the same time. You don’t want to get into a situation where you’re not replicating log files and the database at different times. You can get away with replicating these volumes at different times if you know what you’re doing, but be careful. It’s much simpler to make sure the LUNs are synching at same time.
Run Books - SRM makes it very easy to do a failover to a secondary site by simply pushing a button, but a run book is still a good idea. Imagine a situation where the VMware guy isn’t able to make it to the office due to a disaster, but yet the company needs to fail over the servers. If there is a run book with a few simple instructions, almost anyone could perform a failover.
DR Site Templates - If your production site is a “smoking hole” and you’ve failed over your production site successfully, you might be put in a situation where you need to build a VM, but if you didn’t replicate your templates over to the DR Site, you might be building a server from scratch.
Asynchronous Bandwidth - Take care if you are using asynchronous bandwidth like a DSL connection. This might be a fairly cheap and useful alternative to replicate data to your DR site, but if a disaster occurs you might not have enough bandwidth in the opposite direction to get your data back to your new production site.
Storage DRS - VMware has a nice feature to automatically distribute I/O over different datastores to improve performance, but don’t forget that if a VM is moved from a replicated volume to a non-replicated volume, it might break your recovery plan.
Phone Tree - How do your users know that a disaster was declared? Maybe the disaster happened after hours and users need to be notified about a new method of logging in to work the next day. Consider a phone tree and an offsite website that can quickly be updated to give users instructions for logging in. This will save you time on your helpdesk that could be spent putting out fires.