I have been hearing back from customers about our vast Operations Profile that includes vRealize Operations Manager (vROps), Wavefront, and Cloud Health.
These products have a lot of overlapping features, but together they provide a holistic solution. So why do we need three products to produce a comprehensive solution for Operations?
I can use a couple of buzz words here “Persona,” ” Focus Lens,” Etc but let’s state it as merely Cloud Admins Teams, Application Teams, and C level teams need to view and consume data differently.
vROps is the tool every Cloud Admin needs, and it provides deep analytics, infrastructure costing, capacity planning, alerting, performance troubleshooting for your private cloud. You can use vROPS for some application data, but that data is driven from an infrastructure point of view. The vROps lense is infrastructure focus looking up at the application.
So how do I gather data and provide analytics for my applications that are part of my multi-cloud initiative?
Wavefront will provide analytic data for cross clouds gathering data that can be analyzed and correlated each second to help troubleshoot performance gaps in applications. Sounds great! Why do I need vROps?
Wavefront gathers data at the application level and looks down the stack to infrastructure, the view down into infrastructure with data that derived from an application point of view would not tell you the full story of what is happening in the infrastructure.
What about costs for multiple clouds with a full cost structure for C-Level Teams?
Cloud Health can deliver full cost insight to multiple public clouds like AWS, Azure, GPC, VMC and private clouds. Cloud Health includes optimation, rightsizing to help trim spending waste in the public cloud and private clouds alike. The costing data is immersive with spend tracking and budget reporting.
So why would still need to do costing with vROps?
Cloud Health’s focus on costing metrics at the C-Level/Accounting level, where vROps can provide more of department level show back, chargeback model for private cloud and few public clouds like AWS and Azure.
So three key takeaways are:
vROps –> Infrastructure Operations/Analytics and Costing
WaveFront –> Application Analytics for multi-cloud
CloudHealth –> Multi-cloud C-Level costing
vRealize Operations Manager (vROps) helps customers identify and remediate performance issues, alert management, capacity, and cost.
Please see the great work that commuinity is working to create vROps dashboard repositiory for customers. These Dashboards are created by some of the experts in the field so go and check them out!!
When the book arrived, I dove in head first with the same high expatiations as I had with VMware vSphere 4.1 HA and DRS Deep Dive and vSphere 5.1 clustering deep dive books. The book is well written with excellent diagrams as expected; there was much review throughout the book making this faster read for those that have read other cluster deep dive books. The amount of review is necessary, so everyone is on the same page. With that said there are many changes in both vSphere 6.5 and 6.7 and this book covers the changes in a magnificent manner. I am going to go over a couple of favorite new features, and couple items learned along the way.
vSphere 6.7 DRS algorithm no longer use the snapshotting method that its predecessors used. Older versions of vSphere would take a snapshot of utilization of hosts and compare them for initial placement. DRS could cause an issue when multiple VM’s where being placed at once, multiple VM’s would land on a host that had most resources before next snapshot of resources could take place. vSphere 6.7 kept update performance metrics this allowing VM’s to place faster and distributed more evenly.
Over the last couple of years, I have had many conversations with a wide array of customers about DRS. The general consensuses were that at 10 GBE the cost was minimal for the network but never had a definitive answer on the other costs like CPU. The cost for 1 GBE is 30% reservation on a pCPU core and 100% reservation 10 GBE on both the source and destination host. If you are using encryption with vMotion you the costs are even higher.
Conversation on vMotion and CPU usage has been centered on CPU %, not around the reservation of a CPU core. With the SDDC having vSAN and NSX adding CPU overhead you should pay attention metrics like CPU RDY% and CO-STOP% and set DRS to aggressive (Level 5) for VM happiness and keep host within 5% utilization of each other. Avoid additional vMotions that will happen if turn on advanced options unless you have a business requirement to do so.
One final note KB 2108824 for 40 GBE this means that three cores run with a 100% reservation on them. Yes at 36 GBE a vMotion operations should take an only few seconds.
With End of General Support for vSphere 5.5 approaching quickly, here is quick note that might help get upgrades moving faster. The VM memory overhead on VM’s can be reduced by up to 8x when moving to 6.x.
Stretch clusters section is an impressive display of architectural review of stretch clusters from vSphere clustering standpoint. I want everyone to read this section more than once. Not only did I learn something about stretch clusters, but it also shows you how something as simple as vSphere clustering that has VCDX defense written all over it.
In closing it was worth the time and effort to read this book Frank, Duncan and Niels did a great job.
I started to test the vROps CLI that was created by Blue Medora, with the use case of gathering information for documentation and the ability to do things programmatically in the future.
The first thing you will have to do is download the correct file:
Copy the file to folder location like vROpsCLI there should be two files CLI executable and examples.vropscli.yml. You will need to edit the examples.vropscli.yml to point to your vROps instance and give local credentials with access. Here is an example of examples.vropscli.yml file:
Now you can kick off vropscli file here is a list of the different commands:
As you can see, there is a lot of flexibility from “get” commands for gathering information to creating alerts or updating vROps. I will focus on a few of the “get” commands for quick documentation.
./vropscli getVropsLicense You can pipe each command to *.CSV for readability ./vropscli getVropsLicense >vROps_LicenseKey.csv
./vropscli getAdapters this will list the adapters for this vROps instance allowing us to document them and then use the UUID’s for other commands.
./vropscli getAdapterConfig –adapterId fce7e1fb-c374-4fb5-933c-158d1639b17a
A couple of items here I found interesting the Cloud_TYPE field is set to Privite_Cloud this tells me in future versions will have support for different type clouds most likely round costing functionality that is new to vROps 6.7. VM_LIMIT is set 2 million by default, I find that interesting and would to hear from anyone pushing that limit.
Everyone should all be aware of KB 1003212, and EVC has not changed over the years, but the acceptable terminology has changed over the last decade. One of the things I keep hearing questions about is can you do vMotion between processor families? Recently over the previous five years, Intel has started calling their processor’s generations, families like Skylake family or Haswell family.
Families for EVC don’t equal Intel Families. VMware EVC covers multiple brands of microprocessors both Intel and AMD, for EVC family equals either Intel or AMD. You can not do a vMotion between Intel Family and AMD Family you can vMotion with correct EVC configuration between Skylake, Haswell, Ivy Bridge, ETC, referred to generations of Intel chipsets.
The extra twist is with spectre meltdown if you have patch your hosts to 5.5, 6.0 hosts for spectre meltdown and you’re trying to do vMotion to 6.5 be aware that your destination hosts need to patch to a level that supports specter meltdown.
Another question that been bubbling up is support for Skylake, EVC will only be support for 6.7 and above.
Now let’s talk about how EVC works, it is a masking technology that will mask the functions of newly released processors. Here is a list of Intel processor levels today:
Here is a list of Intel processor levels today:
L0 Intel “Merom” L1 Intel “Penryn”
L1 Intel “Penryn”
L2 Intel “Nehalem”
L3 Intel “Westmere”
L4 Intel “Sandy Bridge”
L5 Intel “Ivy Bridge”
L6 Intel “Haswell
L7 Intel “Broadwell”
L8 Intel “Skylake”
Ok so here are few examples:
L0 L1 L2 L3 L4 L5 L6 L7 L8
EVC enabled at the L5 level “Ivy Bridge” this means that you can vMotion a VM from hosts or clusters that have EVC mode enable at L0-L5 levels. The reason is that CPU instruction sets for L0-L5 are visible to ESXi.
L0 L1 L2 L3 L4 L5 L6 L7 L8
EVC enabled at the L7 level “Broadwell” this means that you can vMotion a VM from hosts or clusters that have EVC mode enable at L0-L7 levels.
L0 L1 L2 L3 L4 L5 L6 L7 L8
EVC is not enabled, each host set compatibility at the level of a processor on the host. Please recall that L8 “Skylake” is only supported on 6.7 so if you have ESXi host with Skylake on 6.5 it will set host CPU instruction set to L7 “Broadwell.”
The configuration maximum for vSAN 6.x is 64 nodes per cluster, by default you can only have 32 nodes in the cluster.
To enable support for more than 32 nodes per cluster, you need to follow the steps outlined in KB 2110081
I am going to walk you through what happens if you go over the 32-node default maximum and how to verify the settings once you made the changes outlined in KB 2110081
When you add the 33rdnode to default vSAN cluster that cluster will show up second VSAN partition. The 34thnode will show up as the third partition, the partition number will continue to increase with each host you add after 32.
You can verify the host is not in the cluster with command ” esxcli vsan cluster ” the epected resault for the patition host will look like this:
Cluster InformationEnabled: true
Enabled: trueCurrent Local Time: 2018-07-17T5:23:35
Current Local Time: 2018-07-17T5:23:35 Local Node UUID: 57e1234c-83a9-add9-ec1f-0cc47ab35674
Local Node UUID: 57e1234c-83a9-add9-ec1f-0cc47ab35674
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 57e1234c-83a9-add9-ec1f-0cc47ab35674
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 57e1234c-83a9-add9-ec1f-0cc47ab35674
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 57e1234c-83a9-add9-ec1f-0cc47ab35674
Sub-Cluster Membership UUID: 57e1234c-83a9-add9-ec1f-0cc47ab35674
You can also check to see if the host is listed uncast list with command “esxcli vsan cluster unicastagent” list to see if the host is present on the list.
Ok, we have verified that host is communicating and is in a cluster by itself time to make changes in noted in KB 2110081. Please note that vSAN 6.61 was used in the environment so we could skip step 4 of the KB.
After the host is rebooted, let’s verify that commands made the changes requested
Esxcli system settings advanced list –o /VSAN/goto11
Esxcli system settings advanced list –o /Net/TcpopHeapMax
At this point you will get Advanced vSAN configuration in sync warring this is expected until you made the change on every host.
Once you complete these steps on every host, the Advanced vSAN configuration sync will go away, and you will be able to add more than 32 hosts to your vSAN cluster.
I had this conversation with many customers over the years that vROps metrics are only as good as the metrics collected and logic applied by vROps.
Recently until VMware tools 10.3, vROps 6.7 was heavily using the following two metrics for memory – consumed and active.
The nature of memory consumed metric is such that if a VM requested memory from the hypervisor the value of this metric will increase. However, it is difficult for the hypervisor to know when to free host physical memory upon virtual machine memory deallocation because guest operating system free list is generally is not publicly accessible. So in practice, the value memory consumed metric very often reaches close to 100%, and it doesn’t decrease even when later VM becomes idle.
Why this happening is because some operating systems tend to assign the entire amount of memory they see in the “machine” for various kind of caches to get the best possible performance. Some database and similar workloads tend to grab as much memory as they can get so they can use large buffers and do most of their work in memory.
While the “active” memory is an estimation by VMKernel based on recently touched memory pages. The purpose of this metric is to ensure that in case of memory overcommitment the hypervisor gives the host memory to the active guest memory as much as possible.
Memory overcommitment allows the hypervisor to use memory reclamation techniques to take the inactive or unused host physical memory away from the idle virtual machines and give it to other virtual machines that will actively use it. Otherwise, without having a counter like “active,” it would not have been possible to support memory overcommitment if the hypervisor cannot reclaim the host physical memory upon virtual machine memory deallocation.
However, the value of this metric is different from what the guest operating system reports, because it is an estimation counted based on whether pages are touched or not within the sampling interval. If the Guest OS has acquired the memory but is not actively touching those may see much lower value than what you see inside the guest. So only by looking inside the guest, it is possible to understand how much of its memory currently used.
Therefore in 6.7 release, the implementation of Memory|Usage% metric was changed to rely on a new metric Memory|Utilization (KB) which considers usage from Guest OS perspective and not from a hypervisor perspective.
This new Memory|Utilization (KB) metric implemented for the new capacity engine, and it was relying on Guest OS memNeeded metric, however, if Guest OS data was not available, then it was falling back to hypervisor’s memory Consumed metric.
Caused a behavioral change from Memory|Usage% perspective in a case when guest data was not available; it started to show the value of memory consumed metric while in previous versions Memory|Usage% was using the Active memory metric of the hypervisor.
From the vROps side we are going to change the implementation of memory usage metric:
1. Memory|Usage(%) – calculate based on the guest metrics (Guest|Needed Memory(KB)) or fallback to vCenter provided mem.usage.average that based on active memory if guest metrics are not available
a. Using guest metrics if available. Otherwise, we restore the behavior that was before 6.7
However, keeping the existing formula of Memory|Utilization (KB) as it is:
2. Memory|Utilization(KB) – leave this metric as is, i. e. calculated based on the guest metrics or fallback to Memory|Consumed(KB) if guest metrics are not available. Capacity planning, WLP, Cost calculation, What-If all are based on this metric.
Although using Memory|Consumed for those engines is quite conservative, but at least it’s safe unlike using Active Memory or Demand (that is based Active Memory as well) that can potentially lead to wrong recommendations.
a. So in case when Guest metrics are not available, virtual machine capacity-related metrics for memory will be calculated based on the consumed metric. If the guest metrics are not available the behavior will be very conservative. This method will be much safer from Capacity Planning, WLP, and What-If perspective when deciding if a VM can fit in the target cluster or not. Rightsizing will not be as aggressive as before and may not recommend downsizing the VM.
b. Since the Memory|Workload % metric is calculated based on memory utilization, it will show much higher values that Memory|Usage%. So if some management packs have defined alerts on Memory|Workload% metric, these alerts may remain triggered.
I have to thanks to Hovhannes Antonyan and Sunny Dua for providing much of the context for this blog article.