Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

rossiauj · ‎05-22-2015

Hi,

We have a 16-node storevirtual 4530 multis-site cluster which we use to publish VMFS datastores to our ESXi 5.5 hosts across two sites. Each site contains 8 storage nodes and 5 ESXi 5.5 hosts. Both nodes and ESXi hosts are connected through 2 x 10 Gb connections (iSCSI configured on ESXi according to HP best practices and ALB on storage nodes, FlowControl enabled on switchports, hosts and storage nodes).

When we use the default Storage Array Pluggin VMW_SATP_DEFAULT_AA with Round-Robin (VMW_PSP_RR) we have no problems. However when we install the recently (re)published HP StoreVirtual Multipathing Extension Module for VMware vSphere 5.5 (AT004-10518) we experience problems when the ESXi hosts are not cleanly shutdown (iLO poweroff or reset).

After the ESXi hosts are started again, we can see only a small number of the recently available VMFS datastores. I can still see the devices listed (so the luns are available to the ESXi host), only the VMFS datastores are not mounted.

When I run the command 'esxcli storage filesystem list' I get these errors on the ESXi host:

Error getting data for filesystem on '/vmfs/volumes/5551c88a-f3041c44-4ab8-8cdcd4afa378': Cannot open volume: /vmfs/volumes/5551c88a-f3041c44-4ab8-8cdcd4afa378, skipping.

According to some VMware KB article this could be attributed to ATS locking behaviour of the VMFS datastores after a unclean shutdown. I disabled ATS on all ESXi hosts and then rebooted them but the reboot got stuck after the message "vfc loaded successfully". It probably took half an hour or even more to boot.

A normal reboot or shutdown (initiated from the vSphere client) does not cause problems, only a dirty/unclean/unexpected poweroff or reset.

I removed the LH MEM from the ESXi hosts and the problem has gone away. However, the LH MEM gave a very big performance boost (site aware iSCSI connections, true multipathing to all nodes in the site instead of only one gateway connection). I noticed performance and throughput boosts of up to 150 to 500 percent. It would be really bad to miss out on all that performance goodies but I just don't feel very safe with the LH MEM installed.

Anyone else seen this problem or know how to fix it.

The ESXi hosts and storage nodes are not yet running production but will be the next couple of days when we migrate our old environment to this new one. I hope this gets sorted out soon.

Thanks in advance.

Jos Rossiau

fqu · ‎05-28-2015

I've had a big problem with the LH MEM too.

We are using 8 nodes P4500G2 upgraded to StoreVirtual 12 and we are on Vsphere 5.0.

I've tried the new LH MEM module instead of using the default RoundRobin (which works perfectly), and i've started to loose connection with LUNs (we have 15 LUNs in RAID 10 and 1 LUN in RAID 10+2 for essential VMs).

I was loosing the connection with the LUN in RAID 10+2 :the Virtual Center and the CMC were on that LUN !

I've had to connect directly to each ESXi to re-configure them to use the good old RoundRobin.

I think that the new LH MEM module is not ready for production now...

balblas · ‎06-08-2015

We have exactly the same problem. It looks like if the connection to the iSCSI device is interrupted the LH PSP does not always recover correctly, even though all paths are fully functional again. Logs show issues with ATS locking, but I believe the root cause has nothing to do with ATS but with paths to the volume not being recovered by the LH PSP.

Looking at the iSCSI connections to the volume in the CMC I noticed only HP MPIO Control connections existed for the failing host and no HP MPIO Data connections. I suspect without data connections the host can't read or write to the LUN.

Rebooting the host is the only way to recover from this. Sometimes the host seems to be stuck and a hard (ungracefull) reset is needed to recover the host.

GarethRUK · ‎06-17-2015

Hi,

Does anyone have an update on this issue. Has anyone spoken to HP support regarding this.

I've seen some very strange behaviour and failures, but would like to understand the problem.

Regards,

Gareth

JazzyB · ‎06-24-2015

I experience a similar problem when i enabled the driver on 1 ESXi 5.5 host and changed 1 datastore to use the HP_MEM driver.

The datastore immediately disappeared from that host and could not be rescanned or refreshed. I could see the iSCSI paths but the partition was "unknown"

Changing the paths back to VMWare round robin did not reset the SCSI paths back to MPIO Data + Control. They were stuck on the HP ones.

I finally gave up - came here to see if there were issues and removed the driver and rebooted the host. Now everything is back to normal.

I'm guessing this is not ready for non lab use right now. Also are there plans to support ESXi 6?

HPstorageTom · ‎07-01-2015

It seems that the issue is not always there after a unclean shutdown or power outage. My systems came at least up without any problems after a hard power outage due to thunderstorms.

I've asked some engineering folks whether they could reproduce the issue, buth they didn't have any luck either. What they asked me now is whether somebody of you would be able to provide a vm-support log pulled after experiencing the failure?

Generally, I can only recommend to open a support call if you experience such an issue (yes I am assuming here that your systems are still under an HP support contract ...).

rossiauj · ‎08-17-2015

After the initial tests with the multipathing module for vSphere 5.5 we decided to go to production without it.

We used the VMware standard plugin instead.

However, I've found some time to fiddle with it and I have installed it on one of the ESXi 5.5 hosts in our farm.

I did a normal reboot (from the console) and I did a Reset (via iLO) and both times the host rebooted and all the LUNs where available.

Although I did download the VIB again it seems to be the same version as before (<name>hp-lh-mem</name><version>5.5.0-12.0.0.55</version>).

We did however split our 16 node storage cluster into two clusters (1 x 10 node and 1 x 6 node), but that should not be the cause of the problem, now does it? Unless it was a max paths problem.

So unfortunately I will not be able to provide HP information on the problem.

I will keep running some more tests on the host and see if it is stable.

Greetings,

Jos Rossiau

oikjn · ‎08-17-2015

depending on the number of servers and nics, a 16 node cluster can cerfainly become a problem for # connections with MPIO. I forget the max numbers offhand, but I do think it is something around that 10 node cluster that you have to start paying attention to that.

rossiauj · ‎08-18-2015

From the deployment guide a 16-node cluster connected to ESXi hosts with 2 x 10Gbit NICs (32 paths per device) would only just be supported. However, a 16-node cluster would not be best practice, so we decided to split it.

Besides, I do not believe the other people having the same problem as we did were using 16-node clusters, not sure how many NICs though.

I installed the module on another of our ESXi 5.5 hosts and will keep it for a while, during which I will do some tests with them.

rossiauj · ‎09-04-2015

Today installed version 61 of the HP LH MEM driver and after reboot was missing 8 of the 26 volumes on the hosts that I updated.

All paths were up and used. CMC showed 2 control and 10 data connections, as normal. ESXi just did not mount the volumes.

I can leave one of the hosts as is, if HP support needs me to do some log searching/troubleshooting, but not for too long as it is not usuable to us now and I need to revert back to the default VMware multipathing driver as soon as possible.

balblas · ‎09-04-2015

Our issue was resolved by disabling VAAI ATS heartbeat as per http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956 as suggested by HP support.

Once VAAI ATS heartbeats were disabled on our VMware hosts, a lost connection to a LeftHand volume no longer caused isuues as long as the host was in mainenance mode. Once the connection was restored a storage rescan cleared all errors on the host again.

david11 · ‎09-15-2015

I'm confused by the last message, I did disable VAAI heartbeat but am experiencing the same issues. The way you wrote it, it sounded like it only resolved your problems in maintenance mode?

I have tried fighting with putting one host on this driver in hopes of gaining better throughput but it seems very flakey thus far.

Could you please be more specific with the order of steps taken to get the datastores to mount correctly after disabling VAAI heartbeat?

Thanks,

David

david11 · ‎09-15-2015

I think I figured it out, I had to re-enter maintenance mode again after disabling the VAAI heartbeat and then exit maintenance mode and it fixed the issue without me having to rescan.

I will follow up with a guide in a few days if this becomes stable, I will first see the performance gains and monitor and then crash the host repeatedly to test recovery for stability.

If all ends well I will post detailed guide with articles I followed and the order that worked for me to get it stable.

Thanks,

David

david11 · ‎09-15-2015

The below is intended to help get people in my situation to improve their performance.

If you are going too try my steps below for how I improved my performance, be sure to only do this on esxi 5.5 hosts that have no VM's running otherwise you will cause downtime / interuptions to your business. As most sites have multiple server hosts you should be able to migrate everything off of a host and test on just an empty host with VM's that are not important 1 by 1 when migrating back to them.

Test, test, and test again to be sure its stable before using these changes in production! /disclaimer over!

Also be sure to read every article link I post in its entirely, they might not apply to your set, they did apply to mine and help me out greatly. YMMV

Ok here is my follow up after the last 5 days of tweaking settings to get better performance on my lefthand Storage. I wanted to share in hopes it helps others with similar setups.

My Equipment:

(2) Dl360 Servers g8. Each server has (2) 1gb links to our LAN for server access, (2) 1gb links to physically seperate storage network.

I run these (2) servers in a vmware esxi 5.5 u3 cluster and it attaches to my Lefthand Storage P4500 G2 nodes.

(4) P4500 G2 nodes in total. Each one has (12) 15k 600gb sas drives giving a total of (48) disks across all (4) nodes.

We are split across two buildings for redundancy. Between these two buildings we have dedicated fiber we own.

So each building broken down into equipment is as follows:

(1) dl360 g8 server.

(2) P4500 g2 lefthand nodes

(2) HP Procurve 2800 series switches (1gb ports) (stacked locally in each building through (4) 1gb ethernet ports) (each switch then has (2) 1gb fibers going across to our other building which attaches to the exact same equipment setup in the other building.

The lefthand P4500 nodes are raid 10 local, and network raid 10 across buildings.

Details of current performance last week as we were migrating more and more servers to this.

Max IO was around 4-5k read and 1-2k writes. Throughput was capping at around 95mb reads and 50mb writes. Terrible I Know! :D. Also note latency was bad while under load. See improvement details below.

The above was with pretty much stock default settings, note the following changes according to HP best practices I found in online documentation:

I Network binded the (2) 1gb ports on each server that went to the storage network. I also had all datastores mounted in vmware round robin. THe p4500 g2 lefthand nodes were setup with Adaptive Load Balancing on their built in (2) 1gb ports. Switches are blanked out with the exception of trunking the ports between the local procruves redundant links. I had tried LACP enabled but it yielded no real difference in results. Most likely because we are not licensed to use dynamic distribution switches in vmware. It comes at an additional cost so we dont have it.

I used SLQIO to benchmark performance and got statistical info from Veeam One monitoring when I ran these benchmarks on virtual server. I use this for my benchmark because most of our load is coming from SQL so it seemed to be the best tool for the job. If you know of better ones, please share! I'm always learning.

First thing I found that was simple and easy fix to my issues was in this article: (Change IO setting to 1 for round robin)

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2069356

After performing this test I saw an amazing performance increase.

Original benchmark:

Datastore Read IO: 4292 - max

Datastore Write IO: 1782 - max

Datastore Read Throughput: 92MB - max

Datastore Write Throughput: 41MB - max

Max Read Latency Spikes: 118ms - avg around 50ms (24 hour period)

Max Write Latency Spikes: 305ms - avg around 100ms (24 hour period)

After IO Change from vmware article above:

Datastore Read IO: 21,623 - max

Datastore Write IO: 11,031 - max

Datastore Read Throughput: 174.35MB - max

Datastore Write Throughput: 104.84MB - max

Max Read Latency Spikes: 82ms - max - avg around 2.83ms (24h period)

Max Write Latency Spikes: 153ms - max - avg around 6.83ms (24h period)

As you can see for a simple change it was a huge boost, most notably the latency avg has declined drastically and when demand hits the spikes hit lower then before and are only occuring for a few seconds before they go back down to normal where as before those spikes would sometimes last 5-30 mins depending on load.

Next I made the change found on this forum (benchmark posts continuing to improve listed below)

First the steps I made in Order:

1. Downloaded HP storage driver found here:

https://h20392.www2.hp.com/portal/swdepot/try.do?productNumber=StoreVirtualSW&lang=en&cc=us&hpappid=114372_PDAPI_SWD_PRO_HPE

2. I run esxi 5.5 so I downloaded that version.

3. Moved the file to the /tmp directory on my secondary host (keep in mind I have all my production VM's on my primary host right now without this driver so I can fail back over in case instability occurs, I wont trust it until I beat it up testing all week to be sure).

4. Installed the driver using this for instructions:

http://h10032.www1.hp.com/ctg/Manual/c04586354

Starting on Page 15, do not confuse HP DSM driver with HP Storevirtual Multipathing Driver.

Also note its called storevirtual now, rebranding, but this works on lefthands running the newest version of LEFTHAND OS 12.5 which is what I'm currently running.

5. Here is where the fun began, after rebooting the (1) host I was testing this multipath driver on it could indeed see all paths but would not mount any datastores.

6. First I fixed the path policy for each datastore in esxi client on each host through ssh:

Here is a great article on how to mass change if you have multiple datastores:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053628

Keep in mind to edit the string they give you accordingly which will be different for every storage system.

In the string they have you run, this part: VMW_PSP_RR needs to be changed to the policy you want to use.

WHen you run this command: esxcli storage nmp satp list you will get a list of plugins you have. You should see "HP_PSP_LH" in the second column

So the above command I modfied as such for my situation (its different for yours so be diligent and find the settings you need to change for yorus:

Command from VB article:

for i in `esxcli storage nmp device list | grep '^naa.xxxx'` ; do esxcli storage nmp device set --device $i --psp VMW_PSP_RR; done

My Command:

for i in `esxcli storage nmp device list | grep '^naa.600'` ; do esxcli storage nmp device set --device $i --psp HP_PSP_LH; done

The above changes I made to be specific were "naa.600" because my storage paths start with naa.600 and I wanted all of them modified to use the new lefthand pathing policy. Also I changed "VMW_PSP_RR" which is default vmware round robin to "HP_PSP_LH"

Use commands to list all of your paths to see the policy updated. Note the IO command I first posted above as my performance increase cannot be ran on Policies not using RoundRobin or this HP_PSP_LH. If you are using single pathing the IO setting above will make no difference to you because its how much IO it lets build up before using a secondary path. I'm already going into enough detail so read everything above to see if it all applies to your situation.

Now moving on...

7. Rebooted this host again and it saw all paths, but OH NO it still wont mount datastores.

THis is where the previous poster to me saved me so credit him with that, disable VAAI heartbeat. For HA clusters it basically how the servers are able to share storage. In ESXI 5.5 U2 Vmware made a change to how this was done which I guess lefthand multipath driver hates.

Here is a vmware article explaining it, Plus the fix to run on each host you want to run this HP driver on. (Edit: I ran this command on both hosts, as HA would not re-enable correctly until done on both to change how heartbeating is detected) I had to do this to actually migrate a vm to the host with Lefthand multipath driver so I could actually test)

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

It tells you how to enable /disable it as needed for your testing in case you need to failback. I would only want anyone using this to test an empty host to get datastores up and running and mounted, and only then migrate non-critical VM's back to this host to test performance increases / stability. Note:heart beat change must be done in cluster on all HA enabled hosts if you want to migrate things to test the lefthand multipath driver on a single host.

Keep in mind all these changes I made I did with 0 downtime and interuption, plan it right and you can do the same, but it does require having at least (2) esxi hosts. If you only have one, there will be downtime. Spin down all VM's first before trying.

8. Now this is where I put it into maintennce mode after this last change I made on my test host only. Then I took it out and HA for my vmware cluster came back online and every datastore mounted. I myself did not need to rescan storage adapters. YMMV.

9. Then I moved a production VM over that was non-critical and started benchmarking.

Here are my final results.

After IO Change from vmware article above:

Datastore Read IO: 21,623 - max

Datastore Write IO: 11,031 - max

Datastore Read Throughput: 174.35MB - max

Datastore Write Throughput: 104.84MB - max

Max Read Latency Spikes: 82ms - max - avg around 2.83ms (24h period)

Max Write Latency Spikes: 153ms - max - avg around 6.83ms (24h period)

After turning on HP Multipathing Driver and disabling VAAI heartbeat

Datastore Read IO: 28,256- max

Datastore Write IO: 17999 - max

Datastore Read Throughput: 221MB - max

Datastore Write Throughput: 141MB- max

Max Read Latency Spikes: 76ms - max - avg around 9.78ms (2h period)

Max Write Latency Spikes: 85 - max - avg around 19.21ms (2h period)

Latency varied a bit but remained around the same. I'm thinking this was because it was during migrating VM that was on to it and the heavy amount of benchmarking performed to hit these max numbers. I might come back and post my 24h period of records with this change later but I think its seems stable and much faster for now.

I suspect the latency spike though might be because my benchmarking hit the maximum that this hardware can handle. It was very close to HP best case posted number for (4) nodes using their best practices.

I hope this helps some of you out, feel free to drop a line if you have more questions for me on this or feel free to share your results if this helped you out!.

I also would like to know of anyone with the same setup that got better results than me to possibly help tweak my settings to go even faster.

/end wall of text.

david11 · ‎09-15-2015

I agree with original poster, any unclean shutdown caused on a host will result in it coming back online without mounting all of the datastores.

Our fix was to put unclean shutdown esxi 5.5 host into maintenance mode and then reboot it again, then exit maintenance mode, problem solved. We repeated this three times.

We produced the effect by simply pulling power cables out of the server to simulate DR scenario. Dont worry it was a test server with nothing running on it other then esxi and a dummy VM for testing purposes. It failed over fine with vmware HA enabled to a host without the lefthand multipathing driver.

I'm ok for being contacted about this by HP to help provide a fix, its a bug I believe as with default vmware multipathing round robin this is a non issue.

However for now I leave lefthand multipathing on for the performance boost I saw posted above in my massive wall of text! :)

I am curious if this happens to people with same setup but on esxi 6.0.

Feel free to ask me questions about my posts, I will follow up daily on this thread for a few weeks with more results I find.

david11 · ‎09-15-2015

Just wanted to give a final reply thanking:

rossiauj - for original post, you are right sir! I have same repeated results with same problem you describe.
balblas - for posting about disabling VAAI heartbeat fix he was recomended by support.

I gave you both a +1 - what you guys did by posting saved me alot of time the past few days implementing your finds with this driver so I could get increased performance.

Thanks!

-David

david11 · ‎10-12-2015

Just FYI after updating to esxi 5.5 u3, this driver was running ok for about two weeks and then it became horrible unstable. I had to revert back to default VMware pathing to get our business back online.

Now that we have lost the throughput and very low latency the lefthand storage is running horrible at the moment. Anyone have any suggetions or tweaks to improve without the use of lefthand mem driver? Main issue honestly is latency, with lefthand driver installed everything stayed under 5ms. Now i have daily spikes above 1k randomly for a few minutes here and there.

mnewpair · ‎10-13-2015

Hello David,

for now we're running a two side VMware (3Cluster)

Currently we set PSP LH active. As we repeatedly experienced problems with disappearing datastores i wanted to try to deactivate VAAI ATS Heartbeat.

You are all speaking about VMware 5.5 U3 we're currently using Vmware 6.0 .

Is it possible that Store Virtual MEM wont work stable on this version for now ?

I see we got 2 possibilities :

Try turning off VAAI ATS Heartbeat

or

Return to round robin and set the IOPS Limit to 1

what would you recommend me ?

P.S. : Sorry for my bad english.

Regards mnewpair

david11 · ‎10-16-2015

I would stick with round robin and IPOS set = 1.

The HP driver is very unstable at the moment and I cannot get it to work right myself.

I would wait until the next revision of HP MEM driver or for HP to provide some sort of update as to why it is unstable.

rossiauj · ‎10-16-2015

After running a few months stable with the HP LH MEM driver on two of the ten ESXi 5.5 hosts, even without disabling VAAI ATS, I noticed some VM's had problems today. When I checked the Host it had some missing VMFS datastores (again, the LUN was visible, just the VMFS datastore did not get mounted). There was no power or network outage that could have caused it, so why it decided to dump the datastore, I don't know.

Luckily, I was able to either shutdown or vmotion all the production VM's of this host and then removed the LH MEM driver from the two ESXi hosts that still used it.

I will advice anyone not to use the HP LH MEM for production purposes until this gets fixed, if ever it gets fixed (won't hold my breath for it).

We have enough spindles to get the required IOPs and latency is pretty low, so I can live without this driver for now.

Regards,

Jos

david11 · ‎10-19-2015

Thanks for letting me know that, it looks like this driver is very unstable for the moment. I'm looking at vmware compatibility matrix and it listed the mem driver as ok for use under esxi 6.0 with lefthand os 12.0 I wonder if my problem is that I am running lefthand os 12.5 and it is simply unstable with this version of lefthand os.

I also notice a new software version for 12.5 came out today or sometime recently which I am updating to now, debating trying to put it back on with vaai disabled just to get out of my latency issue.

I am running on 4 procurve 2810's which I thought would be good enough and it seems they are when using lefthand mem driver to split the load evenly across multiple nic ports. However after removing lefthand mem driver and back to default vmware it seems the traffic is to much for single port on these switches due to the fact that well they are not perfectly suited for iscsi traffic with such low port buffer, I think its like only 750k shared for the entire switch.

However I am only a 2 host, 26 vm shop with very little load on each server, things are just split of for seperated services, each server 1 task type of deal.

I am looking at just replacing the switches with cisco 4900m switches simply because I was able to get a great price on some refurbished ones with 24 ports of 10gb. Can anyone chime in here that has used lefthand p4500 g2 with procurve 2810 to confirm these switches are probably my main problem with bad latency issues and constant congestion. I can confirm shutting down half my VM's seems to relieve most issues and I never see the 1gb ports at full throughput so I'm thinking the issue is simply packet throughput and dropped traffic due to low buffers during burst iscsi traffic.

If anyone else finds a stable way to run the lefthand mem driver please respond back with which version of the hp mem driver you are using and whether or not its paired with hp's newest customized esxi images of 5.5 u3 as its what I'm using.

david11 · ‎10-26-2015

So I had VMware support digging through logs when I had my horrible outage caused by this driver instability. This kind of confirms the driver issue for me without a doubt.

Hopefully HP sees this and its helpful. Also for anyone experiencing hte same you can check your logs for the same types of messages because even if you think its stable now it can randomly just stop working, what finally causes it to do so is still unknown to me.

I know its alot below but I cut alot out to shorten it because these errors repeated for millions of lines because I have multiple VM's which I'm sure most of you do. VMWare recomendationa nd KB article explaining there findings is at the bottom. They confirmed I had all the newest drivers for my platform and recomended I contact the storage provider HP to find out why there driver is randomly reporting APD (all paths down) which is what causes the inaccessable message while still showing the path is up and OK. Hope this helps in HP finding a resolution to the mem driver so they can fix it and give us all the performance we want with stability. Good luck all!

from vmware support:

Hello ,

Greetings!!

I have analyzed the logs,Please find below the log snippet:

ESX build

==========

VMware ESXi 5.5.0 build-3029944

VMware ESXi 5.5.0 Update 3

Host Hardware

===============

ProLiant DL360p Gen8

Hostname

================

vnm00002.amer.dmai.net

VOBD.log

========

15-10-09T15:40:01.004Z: Failed to send event (esx.audit.net.firewall.config.changed); 2 failures so far.

2015-10-09T15:40:01.004Z: Failed to send event (esx.audit.net.firewall.config.changed); 2 failures so far.

2015-10-09T15:40:01.004Z: Failed to send event (esx.audit.net.firewall.port.hooked); 2 failures so far.

2015-10-09T15:40:01.004Z: Failed to send event (esx.problem.storage.iscsi.target.connect.error); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.problem.storage.iscsi.target.connect.error); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.clear.coredump.configured2); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.problem.scratch.partition.unconfigured); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.audit.net.firewall.config.changed); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.audit.dcui.enabled); 2 failures so far.

2015-10-09T15:40:01.005Z: Failed to send event (esx.audit.ssh.enabled); 2 failures so far.

2015-10-09T15:40:02.498Z: [iscsiCorrelator] 202292139us: [vob.iscsi.target.connect.error] vmhba34 @ vmk2 failed to login to iqn.2003-10.com.lefthandnetworks:vmware-generic:328:vnmwhatsup because of a network connection failure.

2015-10-09T15:40:02.498Z: [iscsiCorrelator] 202292585us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.2003-10.com.lefthandnetworks:vmware-generic:328:vnmwhatsup on vmhba34 @ vmk2 failed. The iSCSI initiator could not establish a network connection to the target.

2015-10-09T15:40:02.498Z: An event (esx.problem.storage.iscsi.target.connect.error) could not be sent immediately to hostd; queueing for retry.

2015-10-09T15:40:02.499Z: [iscsiCorrelator] 202293340us: [vob.iscsi.target.connect.error] vmhba34 @ vmk3 failed to login to iqn.2003-10.com.lefthandnetworks:vmware-generic:328:vnmwhatsup because of a network connection failure.

2015-10-09T15:40:02.499Z: [iscsiCorrelator] 202293667us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.2003-10.com.lefthandnetworks:vmware-generic:328:vnmwhatsup on vmhba34 @ vmk3 failed. The iSCSI initiator could not establish a network connection to the target.

2015-10-09T15:40:02.500Z: An event (esx.problem.storage.iscsi.target.connect.error) could not be sent immediately to hostd; queueing for retry.

2015-10-09T15:40:20.156Z: [netCorrelator] 219950263us: [vob.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set vpxHeartbeats succeeded.

2015-10-09T15:40:20.157Z: [netCorrelator] 219950834us: [esx.audit.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set vpxHeartbeats succeeded.

2015-10-09T15:40:20.157Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.

2015-10-09T15:40:22.931Z: [netCorrelator] 222725294us: [vob.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set CIMHttpServer succeeded.

2015-10-09T15:40:22.932Z: [netCorrelator] 222725833us: [esx.audit.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set CIMHttpServer succeeded.

2015-10-09T15:40:23.442Z: [netCorrelator] 223235987us: [vob.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set CIMHttpsServer succeeded.

2015-10-09T15:40:23.442Z: [netCorrelator] 223236349us: [esx.audit.net.firewall.config.changed] Firewall configuration has changed. Operation 'enable' for rule set CIMHttpsServer succeeded.

2015-10-09T15:40:36.194Z: [GenericCorrelator] 235988300us: [vob.user.host.boot] Host has booted.

2015-10-09T15:40:36.194Z: [UserLevelCorrelator] 235988300us: [vob.user.host.boot] Host has booted.

2015-10-09T15:40:36.195Z: [UserLevelCorrelator] 235988750us: [esx.audit.host.boot] Host has booted.

2015-10-09T15:40:36.352Z: [GenericCorrelator] 236146246us: [vob.user.coredump.configured2] At least one coredump target is enabled.

vmkernel.log

=============

2015-10-09T15:38:59.948Z cpu6:33374)VAAI_FILTER: VaaiFilterClaimDevice:270: Attached vaai filter (vaaip:VMW_VAAIP_LHN) to logical device 'naa.6000eb359ec2cd670000000000000209'

2015-10-09T15:38:59.968Z cpu6:33374)FSS: 5099: No FS driver claimed device 'naa.6000eb359ec2cd670000000000000209:1': Not supported

2015-10-09T15:38:59.968Z cpu6:33374)ScsiDevice: 3445: Successfully registered device "naa.6000eb359ec2cd670000000000000209" from plugin "NMP" of type 0

2015-10-09T15:38:59.970Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:424: In satp_lhn_updatePath setting path state to OK. vmhba34:C0:T7:L0

2015-10-09T15:38:59.970Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:508: In satp_lhn_updatePath not calling psp_LHPathBack - first time path is being set!

2015-10-09T15:38:59.971Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:424: In satp_lhn_updatePath setting path state to OK. vmhba34:C1:T7:L0

2015-10-09T15:38:59.971Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:508: In satp_lhn_updatePath not calling psp_LHPathBack - first time path is being set!

2015-10-09T15:38:59.971Z cpu6:33374)StorageApdHandler: 698: APD Handle Created with lock[StorageApd0x41093e]

2015-10-09T15:38:59.971Z cpu6:33374)ScsiEvents: 501: Event Subsystem: Device Events, Created!

2015-10-09T15:38:59.971Z cpu6:33374)VMWARE SCSI Id: Id for vmhba34:C0:T7:L0

0x60 0x00 0xeb 0x35 0x9e 0xc2 0xcd 0x67 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x24 0x69 0x53 0x43 0x53 0x49 0x44

2015-10-09T15:38:59.972Z cpu6:33374)VMWARE SCSI Id: Id for vmhba34:C1:T7:L0

0x60 0x00 0xeb 0x35 0x9e 0xc2 0xcd 0x67 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x24 0x69 0x53 0x43 0x53 0x49 0x44

2015-10-09T15:38:59.972Z cpu6:33374)ScsiDeviceIO: 7493: Get VPD 86 Inquiry for device "naa.6000eb359ec2cd670000000000000124" from Plugin "NMP" failed. Not supported

2015-10-09T15:38:59.972Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.972Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_isManagement:843: In satp_lhn_isManagement returning FALSE.

2015-10-09T15:38:59.972Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.972Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:985: In satp_lhn_pathFailure status = 5 sense key = 24 and sense code = 0. path vmhba34:C0:T7:L0

2015-10-09T15:38:59.972Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:986: path=vmhba34:C0:T7:L0 cmd[0]=12 cmdid=465

2015-10-09T15:38:59.972Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:1132: In satp_lhn_pathFailure unknown failure.

2015-10-09T15:38:59.972Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.972Z cpu6:33374)ScsiDeviceIO: 6213: QErr is correctly set to 0x0 for device naa.6000eb359ec2cd670000000000000124.

2015-10-09T15:38:59.972Z cpu6:33374)ScsiDeviceIO: 6724: Sitpua was correctly set to 1 for device naa.6000eb359ec2cd670000000000000124.

2015-10-09T15:38:59.973Z cpu6:33374)VAAI_FILTER: VaaiFilterClaimDevice:270: Attached vaai filter (vaaip:VMW_VAAIP_LHN) to logical device 'naa.6000eb359ec2cd670000000000000124'

2015-10-09T15:38:59.992Z cpu6:33374)FSS: 5099: No FS driver claimed device 'naa.6000eb359ec2cd670000000000000124:1': Not supported

2015-10-09T15:38:59.992Z cpu6:33374)ScsiDevice: 3445: Successfully registered device "naa.6000eb359ec2cd670000000000000124" from plugin "NMP" of type 0

2015-10-09T15:38:59.993Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:424: In satp_lhn_updatePath setting path state to OK. vmhba34:C0:T2:L0

2015-10-09T15:38:59.993Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:508: In satp_lhn_updatePath not calling psp_LHPathBack - first time path is being set!

2015-10-09T15:38:59.994Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:424: In satp_lhn_updatePath setting path state to OK. vmhba34:C1:T2:L0

2015-10-09T15:38:59.994Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_updatePath:508: In satp_lhn_updatePath not calling psp_LHPathBack - first time path is being set!

2015-10-09T15:38:59.994Z cpu6:33374)StorageApdHandler: 698: APD Handle Created with lock[StorageApd0x41093e]

2015-10-09T15:38:59.994Z cpu6:33374)ScsiEvents: 501: Event Subsystem: Device Events, Created!

2015-10-09T15:38:59.994Z cpu6:33374)VMWARE SCSI Id: Id for vmhba34:C0:T2:L0

0x60 0x00 0xeb 0x35 0x9e 0xc2 0xcd 0x67 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xed 0x69 0x53 0x43 0x53 0x49 0x44

2015-10-09T15:38:59.995Z cpu6:33374)VMWARE SCSI Id: Id for vmhba34:C1:T2:L0

0x60 0x00 0xeb 0x35 0x9e 0xc2 0xcd 0x67 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xed 0x69 0x53 0x43 0x53 0x49 0x44

2015-10-09T15:38:59.995Z cpu6:33374)ScsiDeviceIO: 7493: Get VPD 86 Inquiry for device "naa.6000eb359ec2cd6700000000000000ed" from Plugin "NMP" failed. Not supported

2015-10-09T15:38:59.995Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.995Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_isManagement:843: In satp_lhn_isManagement returning FALSE.

2015-10-09T15:38:59.995Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.995Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:985: In satp_lhn_pathFailure status = 5 sense key = 24 and sense code = 0. path vmhba34:C0:T2:L0

2015-10-09T15:38:59.995Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:986: path=vmhba34:C0:T2:L0 cmd[0]=12 cmdid=482

2015-10-09T15:38:59.995Z cpu4:33106)WARNING: HP_SATP_LH: satp_lhn_pathFailure:1132: In satp_lhn_pathFailure unknown failure.

2015-10-09T15:38:59.995Z cpu6:33374)WARNING: HP_SATP_LH: satp_lhn_getBoolAttr:879: In satp_lhn_getBoolAttr.

2015-10-09T15:38:59.995Z cpu6:33374)ScsiDeviceIO: 6213: QErr is correctly set to 0x0 for device naa.6000eb359ec2cd6700000000000000ed.

2015-10-09T15:38:59.995Z cpu6:33374)ScsiDeviceIO: 6724: Sitpua was correctly set to 1 for device naa.6000eb359ec2cd6700000000000000ed.

2015-10-09T15:38:59.996Z cpu6:33374)VAAI_FILTER: VaaiFilterClaimDevice:270: Attached vaai filter (vaaip:VMW_VAAIP_LHN) to logical device 'naa.6000eb359ec2cd6700000000000000ed'

2015-10-09T15:39:00.018Z cpu6:33374)FSS: 5099: No FS driver claimed device 'naa.6000eb359ec2cd6700000000000000ed:1': Not supported

Analysis:

=========

We have checked and found that there was a network connection failure and APD issue reported during that time stamp.

We have verified the drivers and it's upto date.

Recommendation:

===============

Please contact your storage vendor to find out the cause for APD.

Reference KB article:

http://kb.vmware.com/kb/2004684

Please let me know if you have any clarifications.

david11 · ‎10-26-2015

Also please note I have seperate phsyical network jsut for iscsi traffic and it was working fine for hosts which had generic vmware driver, just the hosts with the hp lefthand mem driver had these issues.

Vmware even states in their KB article when this happens there is no clean way to reset it as the host will lock up trying to reconnect forever, this is why when you force it down or are able to reboot it through console, it will drag for like 30 minutes coming back up. I recomend the fastest recovery to be SSH session to your host, unisntall the lefthand mem driver if possible first before forcing it down so when it boots it uses the vmware generic driver. It will still be a long boot while esxi host clears all the errors it experienced.

YMMV. Hope this is helpful.

miki777 · ‎11-05-2015

Yes, this driver is a total disaster, I've had a total crash of all servers and all virtual machines ( 30 + vm's ), very stressfull event indeed. I'm suprised that HP still didn't solve this problem so far, as may people are having obvious problems with it, but it seems that they are the only ones that are not having this kind of problems with it :D

slymsoft · ‎11-17-2015

My advice is the same as most of the previous ones : DO NOT USE HP MEM IN PRODUCTION OR YOU WILL REGRET IT !

I installed the latest MEM module shipping with LH 12.5 (HP_StoreVirtual_Multipathing_Extension_Module_for_Vmware_vSphere_5.1_AT004-10523.vib) on 4 ESXi 5.1, Storage cluster was an 8 node P4730 with LeftHand OS 12.0. I did not disable VAAI ATS (as it was not mentionned anywhere in HP's documentation).

The next day I installed the MEM module I went through an upgrade from LH 12.0 to 12.5 => after a few node reboot to apply patches the 4 ESXi servers became unresponsive. Everything was just fine in the CMC, volumes were always online.

It was extremely unstable. Some ESXi were hanging then working for a few minutes then hanging again, one did a PSOD and another one was so unstable I could not use 90% of cli commands on it not even a /sbin/services.sh restart or an esxtop :-/

I checked the VMkernel logs of the 4 ESXi and there was a sh*t load of message "satp_lhn_pathfailure".

It took us a day to go back to normal. We had to uninstall the MEM module and go back to the good old VMware Round Robin @ 1 IOPS (working great for years !).

HP Team : Please remove the MEM module from official downloads until it is stable

Sorry for the extended use of red + big font size but I think this issue deserves it and every person reading this post should be scared to use this piece of software. This is exactly how you lose the trust of your clients / partners.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores