Tracking Performance and Usability Enhancements to VMware vSAN Since 6.6.1

I’ve been at VMware for 12 weeks now and continuing to work towards being a vSAN expert. One of my many challenges facing that goal is not only learning the current state of vSAN’s features and capabilities (the latest being 6.7U3) but also learning how vSAN operated in previous versions to articulate to my customers why feature X in this release is relevant to them.

VMware has released updates to vSAN 75 times since the initial release in 2014 and 12 updates in 2019 alone. So where is the best place to start for having a foundational understanding of modern vSAN functionality?

VMware called version 6.6 their “Biggest Release Ever” back in 2017 and admittedly, while at Pure Storage, that’s the version that I started to recognize that vSAN had matured a lot so this version would be the basis for level setting my knowledge on what most customers’ experience with vSAN will be. However, of the handful of customers that I support in my Global Accounts role at VMware, most are running at least vSphere 6.5U3 so vSAN 6.6.1 will be the basis for my learning.

One of the confusions I’m adjusting to diving into vSAN is that vSphere and vSAN versions don’t match. One would reasonably expect a product built into another one to have matching versions but they rarely do. Interestingly, they have matched in the past! One of the most helpful documents I’ve used at VMware while ramping up is KB 2150753, Build numbers and versions of VMware vSAN. I’ve referenced this KB article many times to correlate vSphere and vSAN versions. At the end of the day, matching version numbers is a nice to have “feature” but not matching is the reality of two separate business units working on their own products with specific goals and milestones to reach different major and minor releases.

I’m going to highlight major performance and usability enhancements to vSAN in the past 4 release:

What Was New in vSAN 6.6.1

A typical minor dot-release for vSAN: a few new enhancements but nothing major. Although there were 12 updates to 6.6.1 since it’s initial release (Express Patches, Patches, and Updates), I couldn’t find any release notes. Fundamentally, these were the most important features in this release:

  • VUM Integration: VUM integration automates the process of ensuring that hardware installed in the cluster is on the VMware Compatibility Guide (or HCL). It also provided firmware updates for select hardware vendors such as Dell, Lenovo, Supermicro, and Fujitsu. A known issue in this release is that Custom ISOs are not supported in vSAN build recommendations and hosts built on custom ISOs will display as Non-Compliant.
  • Storage Device Serviceability (Blink Disk Lights): When a device fails, it’s extremely important to be able to find it in the server! This feature gives you the ability to select the disk in the UI and make the LED light blink. Great feature but in this release, it’s limited to HPE DL/ML series servers with Gen 9 controllers.

What Was New in vSAN 6.7 GA

A big usability enhancement in this release was the HTML 5 Client becoming the standard interface for vSphere! Other notable performance enhancements included:

Adaptive Resync

Source: StorageHub

This feature includes three main components: congestion control mechanisms, a dispatch/fairness scheduler, and a bandwidth regulator. In essence, under contention vSAN has the ability to throttle I/O caused by resync operations in favor of prioritizing VM I/O. Before this feature was added, VM I/O was in an every-man-for-himself battle that could cause performance. The adaptive nature of this feature means it’s always on and allows it to be an invisible vSAN operations that doesn’t need any user-defined capabilities. The Adaptive Resync Deep Dive on StorageHub goes into much greater detail.

New Health Checks in vSAN Health

vSAN Health is a cloud-connected, built in framework for providing proactive health checks for vSAN clusters. Participation in VMware’s Customer Experience Improvement Program (CEIP) is mandatory to realize this benefit. This capability was initially released in vSAN 6.6 and additional checks were added in 6.7 included:

  • Host maintenance mode verification
  • Host consistency settings for advanced settings
  • Improved vSAN and vMotion network connectivity checks
  • Improved vSAN Health Service installation check
  • Physical Disk Health checks combine multiple checks into a single health check
  • Improved HCL check
  • Firmware checks are now independent of driver checks

Pete Koehler, in VMware’s HCIBU technical marketing group, wrote a detailed post on the health checks as well as a video: #StorageMinute: vSAN Health Checks – Guidance and Remediation Made Easy.

Stretched Cluster Enhancements

This release had 3 new features to improve performance and relaiability when using stretched clusters. Namely:

  • Intelligent site continuity: If there’s a partition in the cluster (link goes down, etc), vSAN will first validate which site provides maximum data availability before establishing a quorum with the witness. For example, if Site A (preferred) lost a node or a device during the partition and objects are in a degraded state but Site B (secondary) is healthy, vSAN will consider Site B active until Site A is healthy again.
  • Witness traffic separation: A separate vmkernel NIC can be dedicated for vSAN witness traffic when using stretched clusters. Previously it was required for the data network to communicate with the vSAN witness host and that VLAN to be stretched across the WAN as well. When deploying stretched clusters, separating witness traffic is recommended.
  • Efficient inter-site resync: A proxy host is established for components that need to be resynced across sites following a failure instead of copying the objects across the WAN to meet the storage policy requirements
Inter-site resync enhancement in vSAN 6.7 GA

More details on vSAN 6.7 GA updates can be found in the release notes.

What Was New in vSAN 6.7 Update 1

vSAN 6.7U1 seems it was the biggest update to vSAN since 6.6 and there’s a lot of great performance and usability enhancements in this release!

Cluster Quickstart

Cluster Quickstart Wizard enables new vSAN and non-vSAN clusters to be created quickly via the vSphere client or APIs.

Cluster Quickstart Wizard in vSphere 6.7 Update 1

The following tasks are performed to speed up and ease the deployment process of vSphere clusters:

  • Setup HA, DRS, and vSAN
  • Add hosts
  • Select vSAN deployment type
  • Network configuration including vSphere Distributed Switching
  • Disk Group configuration
  • Enable Deduplication & Compression / Encryption

VUM Updates

Remember how in 6.6.1 there was VUM integration? Well kinda…what was missing was the ability to utilize VUM to update vSAN clusters when using OEM-specific ISOs. That’s fixed in this release but still no ability to update vSAN through VUM with custom ISOs.

Maintenance Mode & Host Decommissioning Enhancements

When entering a host into maintenance mode whether to perform updates or simply decommission it, vSAN will now perform a full simulation of the activity (assess the capacity/availability impact of host going into maintenance mode and ability for cluster to redistribute object components) and report back success or failure.

Additionally, the “object repair delay timer” setting (around since vSAN 5.5) is now in the GUI. This allows an administration to modify the amount of time to wait for vSAN to rebuild data when components are out of compliance with the storage policy due to a disk or node failure.

TRIM/UNMAP Support

vSAN now has awareness of TRIM/UNMAP commands sent from the Guest OS and can reclaim previously allocated blocks as free space.

Mixed MTU Support for 2 Node and Stretched Clusters

Remember that Witness Traffic Separation (WTS) feature in 6.7 GA? It was nice that a different vmkernel port could be used to separate vSAN data traffic from witness traffic; however, it was still required that the MTU matched on all vmkernel interfaces. That changed in 6.7U1 and now it’s possible to have Jumbo Frames on the vSAN data vmkernel interfaces while using a standard MTU setting on the vmkernel interface for witness traffic!

Enhanced Health Checks & Support

  • Network performance health check ensures that sufficient performance can be achieved
  • Display and classify multiple, VCG-approved storage controller firmware versions such as not latest, latest, and not on HCL
  • Expanded diagnostics in vSAN Support Insight which give GSS tools to capture network diagnostic data and further reduce the need for collecting and transmitting logs
Enhanced diagnostics for GSS in vSAN Insight

More details on vSAN 6.7 Update 1 features can be found in the release notes.

What’s New in vSAN 6.7 Update 3

Finally! We’ve made it to the current version of vSAN and you may have noticed that we skipped over Update 2. That’s because vSphere 6.7 Update 2 didn’t include any new features or enhancements to vSAN so it was skipped. I guess VMware tries to keep versions aligned after all?

Update 3 is another huge leap forward for vSAN with the biggest being the introduction of Cloud Native Storage. This isn’t specifically tied to just vSAN. Instead, it enables vSphere to provide persistent storage to Kubernetes and gives the vSphere administrator the ability to select the required storage (vSAN, VMFS, NFS) for the pod. There’s an excellent doc on Getting Started with VMware Cloud Native Storage here which walks you through setting up a k8s cluster, deploying applications, and managing container volumes.

  • VUM integration gets another update: instead of showing only the latest version of vSAN, you can create new baselines to stay that allow you to stay at the current version and only show new patches and updates
  • New Monitoring and Dashboards
    • Capacity Monitoring Dashboard has been redesigned to provide better visibility into overall as well as granular utilization. New insights per site, per fault domain, and host/disk level
    • Resync: improved accuracy when displaying time remaining to complete a resync
    • Data migration pre-check: new dashboard that provides detailed information when performing data migration activities for maintenance mode tasks. Provides insight into object compliance, cluster capacity, and even predicts the health of the cluster before placing a host into maintenance mode

Parallel Resynchronization

In the past, when vSAN was resyncing components, it would use a single thread to copy the data. This isn’t really a problem if the components are small as they’re likely to transfer quickly; however, what if we have many max-size components (255GB) due to large VMDKs? For example, a 5TB VMDK will span over 20x 255GB components. In vSAN 6.7U3, it will now leverage numerous parallel streams per component to make resyncs complete faster. Bandwidth for this process is managed by Adaptive Resync that was introduced in 6.7 GA.

Introducing Automatic Rebalance

In previous versions of vSAN, administrators could manually initiate a proactive rebalance after being alerted by a vSAN health check that disk(s) were imbalanced. Now automatic rebalancing can be configured to enable vSAN to handle these operations without user intervention. Information on how to enable automatic rebalancing can be found here. Be sure to adjust the vSAN health check to prevent unnecessary alerts!

New Tool: vsantop

vSphere administrators have been using esxtop for years and now there’s a similar tool, vsantop, to measure CPU usage for storage-related tasks to help with troubleshooting and support cases. This can be especially useful to provide quantifiable measurements to assist administrators understanding the impact of using data services like dedupe & compression or data at rest encryption.

There is still significant enhancements that improved I/O handling, resync and rebalancing performance, and degraded device handling since vSAN 6.6.1 that weren’t mentioned here. VMware has made significant investments in vSAN since it’s release in 2014 and serves as a solid foundation for on-premises and hybrid cloud storage.

This exercise was very productive to help me understand the progress that vSAN has seen over the last 2 years and has better prepared me to discuss upgrade paths and new features with customers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s