Automated failover from VMware to AWS in under 4 minutes

2018. I built an automation toolset for auto-failover from on-prem VMware to AWS. The goal was simple to state: minimize RTO and RPO for critical workloads. The implementation was anything but simple.

How it worked

The system monitored VMware health continuously. vCenter API checks, ESXi host health, datastore latency, VM heartbeat status. When it detected a failure pattern that crossed the threshold, it triggered the failover sequence.

Spin up pre-staged AMIs in AWS. These weren't cold images. They were warm replicas, synced regularly, ready to boot with recent data. Reconfigure DNS to point at the AWS endpoints. Restore services in dependency order. Database first, then app tier, then edge. Sub-4-minute failover for the full stack.

The hard part

The hard part wasn't the automation. The hard part was testing it. You can't simulate a real data center failure in a staging environment. Staging environments don't have the same storage topology. They don't have the same network latency. They don't have the same failure modes.

So I built a chaos engineering framework. It would randomly kill VMs. Disconnect datastores. Partition networks. Simulate the kinds of failures that actually happen in production. And it would measure recovery. Time to detection. Time to failover initiation. Time to service restoration. Every number tracked.

The reaction

The operations team hated it at first. Nobody wants a system that deliberately breaks things in their environment. "What if it triggers during business hours?" "What if the failover itself fails?" Valid concerns. So we started with scheduled windows. Off-hours only. Manual approval to arm the chaos tests.

Then they saw the RTO numbers. Sub-4 minutes, consistently. And they saw the bugs we caught. A DNS TTL that was too high. A service that had a hardcoded IP instead of a hostname. A database connection pool that didn't reconnect after failover. All of these would have been invisible until a real outage. The chaos tests found them first.

The takeaway

A disaster recovery plan you've never tested is not a plan. It's a document. The only way to know your failover works is to fail over. Regularly. Automatically. And measure everything.