As part of many of my vSphere related professional services engagements, such jumpstarts, designs, upgrades and health-checks, I typically address the alarms provided by VMware vCenter Server. Frequently, I recommend creating some custom alarms and configuring specific actions on some alarms to meet customer needs. Although my recommendations are unique for each customer, they tend to have many similarities. Here I am proving a sample of the recommendations that I provided to a customer in Los Angeles, whose major focus is to ensure high availability. In this scenario, the customer does not use an SNMP management system, so we decided to use the option to send emails to the administration team, instead of sending SNMP traps. Also, in this scenario, the customer planned to configure Storage DRS in Manual mode, instead of Automatic mode.
vCenter Alarms and Email Notifications
Configure the Actions for the following pre-defined alarms to send email notifications. I consider each of these alarms to be unexpected and worthy of immediate attention if they trigger in this specific vSphere environment. Unless otherwise stated, configure the Action to occur only when the alarm changes to the Red state.
- Host connection and power state (alerts if host connection state = “not responding” and host power state is NOT = Standby)
- Host battery status
- Host error
- Host hardware fan status
- Host hardware power status (HW Health tab indicates UpperCriticalThreshold = 675 Watts, UpperThresholdFatal=702 Watts)
- Host hardware system board status
- Host hardware temperature status
- Host hardware voltage
- Status of other host hardware object
- vSphere HA host status
- Cannot find vSphere master agent
- vSphere HA failover in progress
- vSphere HA virtual machine failover failed
- Insufficient vSphere HA failover resources
- Storage DRS Recommendation (if the decision is made to configure Storage DRS in a Manual Mode)
- Datastore cluster is out of space
- Datastore usage on disk (Red state is triggered at 85% usage)
- Cannot connect to storage (triggered if host loses connectivity to a storage device)
- Network uplink redundancy degraded
- Network uplink redundancy lost
- Cannot connect to storage (triggered if host loses connectivity to a storage device)
- Health status monitoring (triggers if changes occur to overall vCenter Service status)
- Virtual Machine Consolidation Needed status (triggered if a Delete Snapshot task failed for a VM)
Consider creating these custom alarms on the folders where critical VMs. Optionally, define email actions on some of these.
- Datastore Disk Provisioned (%) (set yellow trigger to 100%, where the provisioned disk space meets or exceeds the capacity.)
- VM Snapshot size (set to trigger at 2 GB)
- VM Max Total Disk Latency (set trigger at 20 ms for 1 minute)
- VM CPU Ready Time – assign these to individual VMs or folders, depending on the number of vCPUs (total virtual cores) assigned to each VM