Array Performance and Data Protection
1748030 Members
5436 Online
108757 Solutions
New Discussion

Veeam advises of new vSphere CBT issue - References vSphere 6.x, Nimble Storage, Thin Disks & VVols.

 
martinco-cae
Advisor

Veeam advises of new vSphere CBT issue - References vSphere 6.x, Nimble Storage, Thin Disks & VVols.

For those who do not subscribe to the Veeam Community Forums Digest, yesterday Gostev advised of a new vSphere CBT issue which is causing "corruption issues after full VM restores". The message references "vSphere 6.x, Nimble Storage, Thin Disks, VVols".

Nimble may not be the only affected vendor and Veeam havent said which Nimble array the issue was reported on.

https://forums.veeam.com/vmware-vsphere-f24/relevant-system-characteristics-affected-by-vsphere-cbt-bug-t49960.html

THE WORD FROM GOSTEV
What can be worse than a new vSphere changed block tracking (CBT) bug? The CBT bug that even an Active Full backup does not help against! And unfortunately, we have just confirmed such bug to exist. Now, I recognize this may look like an April Fool's joke, so to be clear – this is NOT one. We've already demonstrated this bug to VMware support using naked API, so everyone is at risk no matter what vSphere backup product they are using. However, there's a hope that the bug is isolated to the particular storage model and/or Virtual Volumes (VVols) only – otherwise, we'd probably have way more customers reporting failed recoveries.

It all started from a support case coming from a fresh vSphere 6.0 deployment running VMs with thin disks hosted on VVols backed by Nimble storage. The customer was experiencing classic data corruption issues after full VM restores – the restored VMs had Windows firing up chkdsk, and checkdb was reporting corruptions in Microsoft SQL Server databases. This normally points at storage corruption – but production VMs did not have these issues, while backup files' content was matching the checksums. Which made CBT the next suspect - but the following troubleshooting steps revealed that the corruption could occur even when restoring from an active full backup! On the other hand, the issue would not reproduce when CBT was disabled completely. Magic, eh?

But our genius support folks did come up with a way to nail this problem down. They first changed permissions on the vCenter Server account used by Veeam Backup & Replication to make it unable to delete working snapshots created by a backup job. Then, after reproducing the issue again, they cloned the VM from the corresponding working snapshot, mounted VMDK of the clone and VMDK from the full backup file to a Linux box, and did a binary compare – which not surprisingly showed a mismatch in some disk areas. And finally, by referring to the debug log of the corresponding job run, they found the differences were in the disk areas that were NOT returned by QueryChangedDiskAreas() function call with changeID * parameter.

Now, let me step back and explain what this vSphere API function does. It is actually the cornerstone CBT function that is used to query used and changed VMDK blocks. During an incremental run, a backup job passes this function changeID of the snapshot created by the previous run, and thus gets all blocks changed since the last backup – a very simple concept. While during an initial run aka full backup, when there's no previous backup run to reference yet – the special * parameter is passed to this function, which makes it return allocated VMDK blocks only. This dramatically accelerates full backups due to not having to read though TBs of unallocated (and thus obviously empty) VMDK blocks. But even if a backup vendor chooses not to use this functionality for full backups, this query will still be issued by an ESXi itself when CBT is first initialized on a VM – meaning, there's no way to avoid one.

Bear with me, we're almost there now! There's one key difference in QueryChangedDiskAreas() logic with the two scenarios I explained above. Using a changeID belonging to a previous VM snapshot will return all changed blocks that were tracked by the ESXi host itself in the CTK file. This functionality had its own share of bugs in the past years, but all of them were fixed and by now we can be fairly confident that modern ESXi versions track changes reliably respecting disk resizes, VMotions and so on (you know, all those bugs we've been through). However, when using that special changeID * parameter, this function returns allocated blocks based on the data provided by the storage itself. And at least in this particular case, it appears that the storage provides an invalid allocation data.

According to the latest update, VMware is now working on a tool that should help to confirm this bug is indeed with the particular storage. I will keep you posted as we learn more – meanwhile, as always, remember to test your backups! And a big shout out to many VMware teams – SDK support, VADP team, VVOL to name a few! We had an absolutely incredible collaboration working with them on this issue, receiving very prompt responses and seeing great involvement with what in the end appears to most likely be a 3rd party vendor bug. A very refreshing experience, for sure!

Hopefully more specific information on affected configurations will be released in the comings days/weeks.

M