I had this conversation with many customers over the years that vROps metrics are only as good as the metrics collected and logic applied by vROps.
Recently until VMware tools 10.3, vROps 6.7 was heavily using the following two metrics for memory – consumed and active.
The nature of memory consumed metric is such that if a VM requested memory from the hypervisor the value of this metric will increase. However, it is difficult for the hypervisor to know when to free host physical memory upon virtual machine memory deallocation because guest operating system free list is generally is not publicly accessible. So in practice, the value memory consumed metric very often reaches close to 100%, and it doesn’t decrease even when later VM becomes idle.
Why this happening is because some operating systems tend to assign the entire amount of memory they see in the “machine” for various kind of caches to get the best possible performance. Some database and similar workloads tend to grab as much memory as they can get so they can use large buffers and do most of their work in memory.
While the “active” memory is an estimation by VMKernel based on recently touched memory pages. The purpose of this metric is to ensure that in case of memory overcommitment the hypervisor gives the host memory to the active guest memory as much as possible.
Memory overcommitment allows the hypervisor to use memory reclamation techniques to take the inactive or unused host physical memory away from the idle virtual machines and give it to other virtual machines that will actively use it. Otherwise, without having a counter like “active,” it would not have been possible to support memory overcommitment if the hypervisor cannot reclaim the host physical memory upon virtual machine memory deallocation.
However, the value of this metric is different from what the guest operating system reports, because it is an estimation counted based on whether pages are touched or not within the sampling interval. If the Guest OS has acquired the memory but is not actively touching those may see much lower value than what you see inside the guest. So only by looking inside the guest, it is possible to understand how much of its memory currently used.
Therefore in 6.7 release, the implementation of Memory|Usage% metric was changed to rely on a new metric Memory|Utilization (KB) which considers usage from Guest OS perspective and not from a hypervisor perspective.
This new Memory|Utilization (KB) metric implemented for the new capacity engine, and it was relying on Guest OS memNeeded metric, however, if Guest OS data was not available, then it was falling back to hypervisor’s memory Consumed metric.
Caused a behavioral change from Memory|Usage% perspective in a case when guest data was not available; it started to show the value of memory consumed metric while in previous versions Memory|Usage% was using the Active memory metric of the hypervisor.
From the vROps side we are going to change the implementation of memory usage metric:
1. Memory|Usage(%) – calculate based on the guest metrics (Guest|Needed Memory(KB)) or fallback to vCenter provided mem.usage.average that based on active memory if guest metrics are not available
a. Using guest metrics if available. Otherwise, we restore the behavior that was before 6.7
However, keeping the existing formula of Memory|Utilization (KB) as it is:
2. Memory|Utilization(KB) – leave this metric as is, i. e. calculated based on the guest metrics or fallback to Memory|Consumed(KB) if guest metrics are not available. Capacity planning, WLP, Cost calculation, What-If all are based on this metric.
Although using Memory|Consumed for those engines is quite conservative, but at least it’s safe unlike using Active Memory or Demand (that is based Active Memory as well) that can potentially lead to wrong recommendations.
a. So in case when Guest metrics are not available, virtual machine capacity-related metrics for memory will be calculated based on the consumed metric. If the guest metrics are not available the behavior will be very conservative. This method will be much safer from Capacity Planning, WLP, and What-If perspective when deciding if a VM can fit in the target cluster or not. Rightsizing will not be as aggressive as before and may not recommend downsizing the VM.
b. Since the Memory|Workload % metric is calculated based on memory utilization, it will show much higher values that Memory|Usage%. So if some management packs have defined alerts on Memory|Workload% metric, these alerts may remain triggered.
I have to thanks to Hovhannes Antonyan and Sunny Dua for providing much of the context for this blog article.