StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

vSphere takes 10 minutes to reconnect to volume?

Paul Hutchings
Super Advisor

vSphere takes 10 minutes to reconnect to volume?

Doing some testing of a vSphere server with dedicated iSCSI NICs connected to a dedicated iSCSI switch, to which some P4000 nodes are also connected.

The P4000 management group has 2 clusters in it, with each cluster currently having a single volume accessible to the vSphere box.

The vSphere box is setup to use vSphere MPIO with round-robin.

So, I pull the power on the iSCSI switch, watch in vSphere, the NICs go down, the iSCSI connection is lost to both volumes.

I put power back to the switch, watch in vSphere, the NICs come back up, the connection to one volume comes back, the other volume stays offline for around 10 minutes at which point it comes back of its own accord.

Is there anything I may have missed here that would account for this?

I did find this which suggests I may need to disable iSCSI load balancing on the Server object in the CMC?

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016836

Thanks in advance.
24 REPLIES
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Oh and to confirm/clarify doing a refresh of the storage or the HP from within the vSphere client doesn't make the volume re-appear.
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Be grateful if anyone has any thoughts on this as I'm waiting on support to remote in.

I'm confused as my nodes are in the correct sites/switch as are my servers, so I would assume that pulling a single site would mean that the resources in the remaining site stay online which is not what I'm consistently seeing (it seems to vary depending which node the initial gateway connections are made through).
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Hello Paul,

You do not specify which version of vSphere you are using; from my experience of using vSphere 4.0 U2 connecting to P4000 VSAs, you may be having an "All Paths Dead" (APD) on your vSphere servers when connectivity is broken to your iSCSI volumes. If this is the case, you will see a bunch of errors in your servers vmkwarning logs. You would have to patch your servers and apply some configuration changes; search on the string above.

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Thanks for the reply.

I've been using 4.1 (vmware edition) but could use 4.0 U2 or the HP ESX ISO (not entirely clear which additional drivers/patches this has applied).

Be appreciative of any suggestions as I don't have an explicit need to use anything other than a 4.x release.
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

I suggest that you first review you ESX servers' logs especially the vmkwarning one to see what you have in there and then apply either the appropriate patch and / or workarounds. If it is APD that is the cause of this, you may want to test if running "esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD" solves your problem.

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Just had a bit of a eureka moment.

I had been pinging all the cluster IPs from a laptop connected to the switch infrastructure, rather than pinging from the vsphere hosts at the time of a dropout/failure.

When the LUNs disappear I can ping one cluster IP but not the other.

This obviously suggests the vsphere hosts in some way shape or form.
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

oh and I don't have a /var/log/vmkwarning log..
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

If you are using ESXi instead of ESX, you will find the kernel's messages in /var/log/messages. In any cases, you should find the related messages within the /var/log files.

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Thanks, seeing this sort of thing (attached).

What is odd is that all of the servers are connected to a dedicated switch infrastructure.

When we have the issue, neither vSphere host will ping *one* of the cluster IPs but will ping the other one.

The vSphere hosts will ping all the individual nodes in both clusters.

A laptop connected to the switches will ping both cluster IPs.
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Paul,

Reading your description at the beginning of this thread, I assume that you have at least 4 nodes split in 2 clusters? Is it always the same cluster that gives you trouble or is it random?

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

We have:

2xP4500 SAS
2xP4300 MDL (we actually have 4 but only have 2 up/configured).

The ping problem seems to vary but is always consistent in that one or both cluster IPs won't be pingable from one or both vSphere hosts but both cluster IPs will always be pingable from a laptop on the same dedicated network.

Regardles of whether one or both cluster IPs can (or cannot) be pinged from a particular vSphere host, the ALB IP of each node can always be pinged.

This is what's so confusing - pinging the nodes would seem to confirm networking is good/present from the vSphere hosts.
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

From what I gather from your log file, you are using the software iSCSI adapter. If that is the case, ensure that you are using vmkping instead of the ping utility to ensure that you are pinging from the proper port group instead of the ESX console. My guess at this time is that you may have configuration mismatch between your port groups, vSwitch and switch ports.

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

ping/vmkping seem to produce the same results.

Right now behind me I have one host that can ping both cluster IPs and one host that can only ping one cluster IP.

I've tried moving the NICs to different switches (just in case) and it makes no difference.

Yet if I just walk away and come back in 15 minutes I'll they'll both be pinging away.

It seems to be "something" that is happening either when the iSCSI initiator is upset, or when the NICs are reconnected (I'm using different NICs on a different card right now).
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Then I suggest you take it back one step, disable iSCSI altogether and only use vmkping to test your connectivity issues if possible. If you have a monitoring port on your switches, you may want to snif the traffic to see if the ICMP messages are flowing back and forth.
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Yes I think that is "Plan B" - install ESX from scratch and start with a single VMK then add iSCSI etc.

If I'm doing that, does anyone have any thoughts on 4.1 vs. 4.0 U2 vs. the HP December 4.1 ISO?
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Right, iSCSI initiator disabled, same result.

From the vSphere hosts I can ping all the nodes ALB IP's, but cannot ping either cluster IP from one host following a switch pull/reconnect (and again, I'm guessing in around 10 minutes from now the ping will work).

From the CMC I can ping both VMKs on the vsphere hosts from any node.
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

The issue then seems more network related than an iSCSI connectivity problem. From you description of the symptoms, it looks to me like an ARP resolution / cache issue. Do not know right now how to check for that; I suggest that you review at least your port groups and vSwitch configs to ensure that there is no mismatch.

Are you using any port lock down security measures on your switches?
When you are pinging from a workstation, is it located in the same subnet as your P4000 nodes? Is the ARP table the same for the workstation after you restart your switch?
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Yes that's my thinking now that we appear to have eliminated the iSCSI initiator.

The switches are 2 x 2910al 1 x 2510g in a redundant loop (so MSTP enabled) and and have flow-control enabled, and a VLAN for iSCSI, that's about all (can post configs if required).

No port lock down, these switches are dedicated for iSCSI, the workstation is just a laptop that I connect to any of the switches.

I will check out the ARP table - I didn't see a way to check that within vSphere though.

(oh and sure enough something timed out and my ping started working again all by itself)
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

If you are using multiple switches, it may also be your switches config that affect the vSphere host. VMware recommends disabling any STP methods on ESX connected ports. As an additional step, you could test your switch reboot scenario by disabling / disconnecting one of the ESX server interface that is a member of the iSCSI segment (assuming that you only use 2 interfaces). You will loose connectivity during the reboot but it should not take up to 10 minutes to ping your cluster IPs.
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Looks like it might be the vSphere ARP table timeout as if I look at before/after between the laptop and vSphere hosts they don't match - vSphere seems to stick on the "before".

There isn't an obvious way to clear it on ESXi so I may install ESX tomorrow and look at that.

I'm not entirely convinced though as enough people are presumably using P4000 with ESXi.

I'll be interested to see what HP come back with when the remote in.
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

"If you are using multiple switches, it may also be your switches config that affect the vSphere host. VMware recommends disabling any STP methods on ESX connected ports. As an additional step, you could test your switch reboot scenario by disabling / disconnecting one of the ESX server interface that is a member of the iSCSI segment (assuming that you only use 2 interfaces). You will loose connectivity during the reboot but it should not take up to 10 minutes to ping your cluster IPs."

Ah we both replied at the same time I think.

Regarding STP/MSTP, right now it's simple enabled globally on the switch, I've not done anything to restrict it to the inter-switch link ports.

I believe with full ESX you can clear the ARP cache which should make troubleshooting much easier (hell even a cron job to clear it every minute would be better than this!!).
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Definitely the ARP table.

You can run "esxcli network neighbor list" from the CLI and see the arp table and the lifetime remaining in seconds.

As the lifetime counts down you can't ping the cluster IP, when the lifetime reaches zero ping starts working.

Bugger.
adolbec
Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Paul,

It does not help you but here is what I found about ARP control on ESXi: http://serverfault.com/questions/197918/clearing-arp-cache-on-esxi-4-1. My feeling is that you will have to open a support request with VMware. It seems to me that whenever a team has it interfaces reset, it should flush the existing dynamic info and learn again. If you have any workaround on the issue, would appreciate that you post back to the forum.

My 2 cents.

Good luck.

Alain
Paul Hutchings
Super Advisor

Re: vSphere takes 10 minutes to reconnect to volume?

Thanks Alain, the same link I used to see how to view the arp table :)

I'll see what L2 P4000 have to say first - enough people must be doing this that I don't see why I should be the only one to encounter this issue so it may be our configuration somehow (it's entirely HP kit so you'd think they would know it).

If not, I'll try ESX and check that emptying the arp cache manually works.

Appreciate the help/info/feedback :)