StoreVirtual Storage
1753518 Members
5100 Online
108795 Solutions
New Discussion

StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

 
balblas
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

Our issue was resolved by disabling VAAI ATS heartbeat as per http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956 as suggested by HP support.

 

Once VAAI ATS heartbeats were disabled on our VMware hosts, a lost connection to a LeftHand volume no longer caused isuues as long as the host was in mainenance mode. Once the connection was restored a storage rescan cleared all errors on the host again.

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

I'm confused by the last message, I did disable VAAI heartbeat but am experiencing the same issues.  The way you wrote it, it sounded like it only resolved your problems in maintenance mode?

 

I have tried fighting with putting one host on this driver in hopes of gaining better throughput but it seems very flakey thus far.

 

Could you please be more specific with the order of steps taken to get the datastores to mount correctly after disabling VAAI heartbeat?

 

Thanks,

David

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

I think I figured it out, I had to re-enter maintenance mode again after disabling the VAAI heartbeat and then exit maintenance mode and it fixed the issue without me having to rescan.

 

I will follow up with a guide in a few days if this becomes stable, I will first see the performance gains and monitor and then crash the host repeatedly to test recovery for stability.

 

If all ends well I will post detailed guide with articles I followed and the order that worked for me to get it stable.

 

Thanks,

David

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

The below is intended to help get people in my situation to improve their performance.

 

If you are going too try my steps below for how I improved my performance, be sure to only do this on esxi 5.5 hosts that have no VM's running otherwise you will cause downtime / interuptions to your business.  As most sites have multiple server hosts you should be able to migrate everything off of a host and test on just an empty host with VM's that are not important 1 by 1 when migrating back to them.

 

Test, test, and test again to be sure its stable before using these changes in production!  /disclaimer over!

 

Also be sure to read every article link I post in its entirely, they might not apply to your set, they did apply to mine and help me out greatly.  YMMV

 

 

Ok here is my follow up after the last 5 days of tweaking settings to get better performance on my lefthand Storage.  I wanted to share in hopes it helps others with similar setups.

 

My Equipment:

(2) Dl360 Servers g8.   Each server has (2) 1gb links to our LAN for server access,  (2) 1gb links to physically seperate storage network.

 

I run these (2) servers in a vmware esxi 5.5 u3 cluster and it attaches to my Lefthand Storage P4500 G2 nodes.

 

(4) P4500 G2 nodes in total.  Each one has (12) 15k 600gb sas drives giving a total of (48) disks across all (4) nodes.

 

We are split across two buildings for redundancy.  Between these two buildings we have dedicated fiber we own.

 

So each building broken down into equipment is as follows:

(1) dl360 g8 server.

(2) P4500 g2 lefthand nodes

(2) HP Procurve 2800 series switches (1gb ports) (stacked locally in each building through (4) 1gb ethernet ports) (each switch then has (2) 1gb fibers going across to our other building which attaches to the exact same equipment setup in the other building.

 

The lefthand P4500 nodes are raid 10 local, and network raid 10 across buildings.

 

Details of current performance last week as we were migrating more and more servers to this.

 

Max IO was around 4-5k read and 1-2k writes.  Throughput was capping at around 95mb reads and 50mb writes.  Terrible I Know! :D.     Also note latency was bad while under load.  See improvement details below.

 

The above was with pretty much stock default settings, note the following changes according to HP best practices I found in online documentation:

 

I Network binded the (2) 1gb ports on each server that went to the storage network.  I also had all datastores mounted in vmware round robin.  THe p4500 g2 lefthand nodes were setup with Adaptive Load Balancing on their built in (2) 1gb ports.  Switches are blanked out with the exception of trunking the ports between the local procruves redundant links.  I had tried LACP enabled but it yielded no real difference in results.  Most likely because we are not licensed to use dynamic distribution switches in vmware.  It comes at an additional cost so we dont have it.

 

I used SLQIO to benchmark performance and got statistical info from Veeam One monitoring when I ran these benchmarks on virtual server.  I use this for my benchmark because most of our load is coming from SQL so it seemed to be the best tool for the job.  If you know of better ones, please share!  I'm always learning.

 

First thing I found that was simple and easy fix to my issues was in this article: (Change IO setting to 1 for round robin)

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2069356

 

After performing this test I saw an amazing performance increase.

Original benchmark:

Datastore Read IO:   4292 - max

Datastore Write IO:  1782 - max

Datastore Read Throughput:  92MB - max

Datastore Write Throughput: 41MB - max

Max Read Latency Spikes: 118ms  - avg around 50ms (24 hour period)

Max Write Latency Spikes: 305ms - avg around 100ms (24 hour period)

After IO Change from vmware article above:

Datastore Read IO:   21,623 - max

Datastore Write IO:  11,031 - max

Datastore Read Throughput:  174.35MB - max

Datastore Write Throughput: 104.84MB - max

Max Read Latency Spikes: 82ms - max - avg around 2.83ms (24h period)

Max Write Latency Spikes: 153ms - max - avg around 6.83ms (24h period)

 

As you can see for a simple change it was a huge boost, most notably the latency avg has declined drastically and when demand hits the spikes hit lower then before and are only occuring for a few seconds before they go back down to normal where as before those spikes would sometimes last 5-30 mins depending on load.

 

Next I made the change found on this forum (benchmark posts continuing to improve listed below)

First the steps I made in Order:

1. Downloaded HP storage driver found here:

https://h20392.www2.hp.com/portal/swdepot/try.do?productNumber=StoreVirtualSW&lang=en&cc=us&hpappid=114372_PDAPI_SWD_PRO_HPE

 

2.  I run esxi 5.5 so I downloaded that version.

 

3. Moved the file to the /tmp directory on my secondary host (keep in mind I have all my production VM's on my primary host right now without this driver so I can fail back over in case instability occurs, I wont trust it until I beat it up testing all week to be sure).

 

4.  Installed the driver using this for instructions: 

http://h10032.www1.hp.com/ctg/Manual/c04586354

Starting on Page 15, do not confuse HP DSM driver with HP Storevirtual Multipathing Driver.

Also note its called storevirtual now, rebranding, but this works on lefthands running the newest version of LEFTHAND OS 12.5 which is what I'm currently running.

 

5. Here is where the fun began, after rebooting the (1) host I was testing this multipath driver on it could indeed see all paths but would not mount any datastores.

 

6. First I fixed the path policy for each datastore in esxi client on each host through ssh:

Here is a great article on how to mass change if you have multiple datastores:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053628

Keep in mind to edit the string they give you accordingly which will be different for every storage system.

In the string they have you run, this part: VMW_PSP_RR needs to be changed to the policy you want to use.

WHen you run this command: esxcli storage nmp satp list you will get a list of plugins you have.  You should see "HP_PSP_LH" in the second column

 

So the above command I modfied as such for my situation (its different for yours so be diligent and find the settings you need to change for yorus:

 

Command from VB article: 

for i in `esxcli storage nmp device list | grep '^naa.xxxx'` ; do esxcli storage nmp device set --device $i --psp VMW_PSP_RR; done

 

My Command:

for i in `esxcli storage nmp device list | grep '^naa.600'` ; do esxcli storage nmp device set --device $i --psp HP_PSP_LH; done

 

The above changes I made to be specific were "naa.600" because my storage paths start with naa.600 and I wanted all of them modified to use the new lefthand pathing policy.  Also I changed "VMW_PSP_RR" which is default vmware round robin to "HP_PSP_LH"

 

Use commands to list all of your paths to see the policy updated.  Note the IO command I first posted above as my performance increase cannot be ran on Policies not using RoundRobin or this HP_PSP_LH.  If you are using single pathing the IO setting above will make no difference to you because its how much IO it lets build up before using a secondary path.  I'm already going into enough detail so read everything above to see if it all applies to your situation.

 

Now moving on...

 

7.  Rebooted this host again and it saw all paths, but OH NO it still wont mount datastores.

THis is where the previous poster to me saved me so credit him with that, disable VAAI heartbeat.  For HA clusters it basically how the servers are able to share storage.  In ESXI 5.5 U2 Vmware made a change to how this was done which I guess lefthand multipath driver hates.

 

Here is a vmware article explaining it, Plus the fix to run on each host you want to run this HP driver on. (Edit:  I ran this command on both hosts, as HA would not re-enable correctly until done on both to change how heartbeating is detected)  I had to do this to actually migrate a vm to the host with Lefthand multipath driver so I could actually test)

 

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

 

It tells you how to enable /disable it as needed for your testing in case you need to failback.   I would only want anyone using this to test an empty host to get datastores up and running and mounted, and only then migrate non-critical VM's back to this host to test performance increases / stability.  Note:heart beat change must be done in cluster on all HA enabled hosts if you want to migrate things to test the lefthand multipath driver on a single host.

 

Keep in mind all these changes I made I did with 0 downtime and interuption, plan it right and you can do the same, but it does require having at least (2) esxi hosts. If you only have one, there will be downtime. Spin down all VM's first before trying.

 

8.  Now this is where I put it into maintennce mode after this last change I made on my test host only.  Then I took it out and HA for my vmware cluster came back online and every datastore mounted.  I myself did not need to rescan storage adapters.  YMMV.

 

9.  Then I moved a production VM over that was non-critical and started benchmarking.

 

Here are my final results.

 

After IO Change from vmware article above:

Datastore Read IO:   21,623 - max

Datastore Write IO:  11,031 - max

Datastore Read Throughput:  174.35MB - max

Datastore Write Throughput: 104.84MB - max

Max Read Latency Spikes: 82ms - max - avg around 2.83ms (24h period)

Max Write Latency Spikes: 153ms - max - avg around 6.83ms (24h period)

After turning on HP Multipathing Driver and disabling VAAI heartbeat

Datastore Read IO:   28,256- max

Datastore Write IO:  17999 - max

Datastore Read Throughput:  221MB - max

Datastore Write Throughput: 141MB- max

Max Read Latency Spikes: 76ms - max - avg around 9.78ms (2h period)

Max Write Latency Spikes: 85 - max - avg around 19.21ms (2h period)

 

Latency varied a bit but remained around the same.  I'm thinking this was because it was during migrating VM that was on to it and the heavy amount of benchmarking performed to hit these max numbers.  I might come back and post my 24h period of records with this change later but I think its seems stable and much faster for now.

I suspect the latency spike though might be because my benchmarking hit the maximum that this hardware can handle.  It was very close to HP best case posted number for (4) nodes using their best practices.

 

I hope this helps some of you out, feel free to drop a line if you have more questions for me on this or feel free to share your results if this helped you out!.

 

I also would like to know of anyone with the same setup that got better results than me to possibly help tweak my settings to go even faster.

 

/end wall of text.

 

 

 

 

 

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

I agree with original poster, any unclean shutdown caused on a host will result in it coming back online without mounting all of the datastores. 

 

Our fix was to put unclean shutdown esxi 5.5 host into maintenance mode and then reboot it again, then exit maintenance mode, problem solved.  We repeated this three times.

 

We produced the effect by simply pulling power cables out of the server to simulate DR scenario.  Dont worry it was a test server with nothing running on it other then esxi and a dummy VM for testing purposes.  It failed over fine with vmware HA enabled to a host without the lefthand multipathing driver.

 

I'm ok for being contacted about this by HP to help provide a fix, its a bug I believe as with default vmware multipathing round robin this is a non issue.

 

However for now I leave lefthand multipathing on for the performance boost I saw posted above in my massive wall of text! :)

 

I am curious if this happens to people with same setup but on esxi 6.0.

 

Feel free to ask me questions about my posts, I will follow up daily on this thread for a few weeks with more results I find.

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

Just wanted to give a final reply thanking:

rossiauj - for original post, you are right sir! I have same repeated results with same problem you describe.
balblas - for posting about disabling VAAI heartbeat fix he was recomended by support.

 

I gave you both a +1 - what you guys did by posting saved me alot of time the past few days implementing your finds with this driver so I could get increased performance.

 

Thanks!

 

-David

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

Just FYI after updating to esxi 5.5 u3, this driver was running ok for about two weeks and then it became horrible unstable.  I had to revert back to default VMware pathing to get our business back online.

 

Now that we have lost the throughput and very low latency the lefthand storage is running horrible at the moment.  Anyone have any suggetions or tweaks to improve without the use of lefthand mem driver?  Main issue honestly is latency, with lefthand driver installed everything stayed under 5ms.  Now i have daily spikes above 1k randomly for a few minutes here and there.

mnewpair
Occasional Visitor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

Hello David,

 

for now we're running a two side VMware (3Cluster)

 

Currently we set PSP LH active. As we repeatedly experienced problems with disappearing datastores i wanted to try to deactivate VAAI ATS Heartbeat.

 

You are all speaking about VMware 5.5 U3 we're currently using Vmware 6.0 .

 

Is it possible that Store Virtual MEM wont work stable on this version for now ?

 

I see we got 2 possibilities :

 

Try turning off VAAI ATS Heartbeat

 

or 

 

Return to round robin and set the IOPS Limit to 1

 

what would you recommend me ?

 

P.S. : Sorry for my bad english.

 

Regards mnewpair

david11
Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

I would stick with round robin and IPOS set = 1.

 

The HP driver is very unstable at the moment and I cannot get it to work right myself.

 

I would wait until the next revision of HP MEM driver or for HP to provide some sort of update as to why it is unstable.

rossiauj
Occasional Advisor

Re: StoreVirtual Multipathing Extension Module for vSphere 5.5 missing VMFS datastores

After running a few months stable with the HP LH MEM driver on two of the ten ESXi 5.5 hosts, even without disabling VAAI ATS, I noticed some VM's had problems today. When I checked the Host it had some missing VMFS datastores (again, the LUN was visible, just the VMFS datastore did not get mounted). There was no power or network outage that could have caused it, so why it decided to dump the datastore, I don't know.

 

Luckily, I was able to either shutdown or vmotion all the production VM's of this host and then removed the LH MEM driver from the two ESXi hosts that still used it.

 

I will advice anyone not to use the HP LH MEM for production purposes until this gets fixed, if ever it gets fixed (won't hold my breath for it).

 

We have enough spindles to get the required IOPs and latency is pretty low, so I can live without this driver for now.

 

Regards,

 

Jos