Why use Proactive HA? - ryanbirk.com

What is Proactive HA?

Proactive HA will detect hardware conditions of a host and allow you to evacuate the VMs before the issue causes an outage. Failure happens at the most inopportune times. It’s possible that degraded hardware goes on for minutes, hours, or even days and when it eventually fails, workloads need to be HA restarted. In reality, if only vCenter or the administrator had known, it could have kept the workloads from failing!

Proactive HA can respond to different types of failures. Currently, there are five failure events that it uses.

Power Supply.
Memory.
Fan.
Storage.
Network.

In your typical server failure, the server just goes down and HA restarts VMs. However, with Proactive HA, it allows you to configure certain actions for events that MAY lead to VM downtime. For instance, lets say a power supply has gone down. Your server has redundant power supplies, so your server is still up but now has a single point of failure and is in a degraded state. When this occurs Proactive HA will be triggered and evacuation of remaining VMs will be moved to a healthy host in the cluster and the failed host will be put into one of the below “modes”.

How do we know the host is degraded? There are new items called Health Providers that come into play. The Health Providers as of this writing are Dell, Cisco and HP but I am sure that there will be more added in the future.

The health provider reads all the sensor data from the server and analyzes the results and sends the state of the host to a vCenter Server. These states are Healthy, Moderate Degradation, Severe Degradation and Unknown. Also known as Green, Yellow and Red! Each provider will be different depending on what server vendor they are from and may have additional features/functionality vs. what their competitors offer, so be aware of that. Once vCenter is in the loop and aware of the degraded host, DRS can now act based on the state of the hosts in a cluster. As with traditional DRS, it evaluates where VMs can go and migrates them to their new hosts.

There are three options for partial failed hosts:

Quarantine mode – Do not add new VMs to the host.
Maintenance mode – Migrate all VMs of the host and place it in maintenance mode.
Mixed mode – Considered a moderate failure, keep VMs running. But for severe failures, it will migrate VMs.

Let’s talk about Quarantine mode first. The quarantine mode state allows you to configure vMotion of VMs of the cluster if there will be:

No performance impact on any other VMs in the cluster.
None of the DRS rules are compromised.

Quarantine mode also makes sure that none of the newly built VMs in the cluster are placed on that host. It can evacuate off the VMs entirely (Maintenance Mode) and not allow any new machines to be placed on the failed host. When you build a new machine it also will take it into consideration and not put new machines on that host.

Now that we’ve covered quarantine mode, let’s cover maintenance mode in a bit more detail. Maintenance Mode will evacuate all the VMs off the host. You might be familiar with this mode already as it’s been around for a while. Often used for patching hosts. It does not allow any VMs to run whatsoever.

With Quarantine Mode a full evacuation is not guaranteed. Quarantine Mode is considered the new middle ground. An ESXi host in quarantine can and will be used to satisfy VM demand where needed, the opposite of Maintenance Mode.

Leave a Reply Cancel reply