Email Subscription Notifications Suspended Temporarily
We are in the process of making navigation in the Servers and Operating Systems forums simpler and more direct. While doing this, we have to temporarily suspend email notifications for subscriptions. If you are subscribed to one or more discussion boards or blogs in the community, please check them daily to see new content. Notifications will be turned back on in a few days. We apologize for any inconvenience this may cause. Thanks, Warren_Admin
StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

SAN/IQ upgrade performance degradation caused VMFS iSCSI timeouts

SAN/IQ upgrade performance degradation caused VMFS iSCSI timeouts

We were recently/finally upgrading from SAN/IQ 8.5 to 9.0.  We've been a LeftHand customer for years, since way back before the acquisition, so we're really familiar with the upgrade process and have done it a number of times since 7.0.

 

It appears that the "parallelization" of the new 9.0 upgrade tool may have grabbed more performance from our P4500 and P4500G2 nodes than they could actually spare, causing our ESXi servers to report, via syslog, to the mothership:

 

vmc004 iscsid: Kernel reported iSCSI connection 5:0 (iqn.2003-10.com.lefthandnetworks:san-nj1-mg:68:vmfs001 if=default addr=10.16.56.10:3260 (TPGT:1 ISID:0x1) (T4 C0)) error (1011) state (2)

 

Now, repeat that for about 8 VMFS volumes and a dozen VMC hosts, and you've got every single one of our VMs' underlying Linux filesystems dropping into Read-Only mode and requiring a reboot in order to come back to sanity. 

 

We've never experienced performance problems this bad in any previous upgrade paths, but those upgrades all went one at a time across the nodes in a slow methodical manner. My conjecture is that the new parallel process was eating up cycles on all the nodes in parallel (even if only one node at a time was physically "out of the cluster" for reboot/resync/whatever), and with a cluster our size, it could easily have been bouncing/resyncing two nodes at the same time, further putting a crimp on performance.

 

(a) Has anyone else seen this happen?

(b) Has anyone come up with a way to prevent it from happening in the future?