HPE StoreVirtual Storage / LeftHand
cancel
Showing results for 
Search instead for 
Did you mean: 

Storevirtual 3200 Latency Issue

richa3312
Regular Visitor

Storevirtual 3200 Latency Issue

Hi All,

I'm wondering if anyone can help or has experianced a similar problem. Its a long post so please bear with me!

We have recently purchased a Storevirtual 3200 SFF 10GB unit.

Its currently configured with 7 x 10K SAS in 2 x 3 disk RAID 5 sets along with a spare. We have exported a single volume (network RAID 0) to a single host which uses the Microsoft ISCSI initiator with multipathing enabled (4 x 1Gb connections).

On this volume we are running a single VM using Hyper-V 2012 R2 the VM has nothing running on it, it doesn't even have its network connected.

There is no other workload currently on the unit.

With this setup we are experiancing high write latency to the storage.

The VM hosts reports 40-50ms latency on average to the exported volume.  Intermittantly the latency will drop to what we consider normal ie 1-2ms and will remain at this level for several hours before jumping back to the 40-50ms level. Within the VM the latency will be slightly higher and although it doesn't seem to cause any issues it does seem sluggish and would probably implode if any load was placed on it.

When we look at IOPS on the datastore we see 1-2 IOPS regardless of latency.

We have also tried connecting from other hosts and the same issues persists.

At first we thought we had a networking issue however we've realised that the latency is present in the performance charts on the Storevirtual itsself so this seems unlikley. Also when copying large files to the unit throughput is good and easily saturates 1Gb ethernet.

We have also noticed some strange behaviour when failing over between storage controllers. If we failover to either storage controller the latentcy dissapears when we failback the latency returns.

Oddly if we leave the controller failed over for a long period of time at some point the latency will return.

Out of interest we have run Microsofts Diskspd programs to check the IOPs on the unit and compared this to the drives on the host server.

On the host server which has 2 x 10K SAS SFF in RAID 1 with a 2GB FBWC (ar440) we see very high IOPS and throughput. If we disable the FBWC using HP SSA things look far more as we'd expect and vaguly inline with performance suggested by a RAID calculator for 2 10K disks in RAID 1.

The Storevirtual on the same test doesn't behaviour as if it has a write cache at all and performs inline with what a RAID calculator suggest for two RAID 5 arrays with a stripe.

Even without the cache I wouldn't expect to see this latency when there is no load on the unit in fact i wouldn't expect this on a single SATA drive!

Has anyone seen this before? Does anyone have a similar setup? Am I expecting too much? It just doesn't seem right to me.

I do have a case open with support but its slow and we keep going around in circles.


Thanks for looking!

PS I have graphs and screen shots which I will upload if I can work out how to!

40 REPLIES
HPE_Help
Occasional Visitor

Re: Storevirtual 3200 Latency Issue

Hello richa3312 - can you provide to me the support case ID so I can track down what is taking place on that side.

Also what type of drives are involved in your set up?

Thank you,

Karl

richa3312
Regular Visitor

Re: Storevirtual 3200 Latency Issue

Thanks Karl,

The case reference is 5317502427

I had another support session on Friday and they have now escalated it to 3rd line after running some IO meter tests and seeing high write latency. I'm expecting them to come back to me at some point tomorrow.

The unit has  7 x SFF 10K 1.8TB SAS drives in it.

Regards,

 

 

Re: Storevirtual 3200 Latency Issue

We also have a StoreVirtual 3200 unit suffering from write latency issues. Ours is running 10GbE iSCSI to 3 ESXi hosts. The unit is home to 4 SSDs, 21 10k SAS drives and 12 7.2k SAS drives - performance is awful on every tier.

I'd be really keen to know your resolution if and when you get one!

SVprodmgr
Advisor

Re: Storevirtual 3200 Latency Issue

Regarding support case 5317502427, this was resolved by reviewing how SV3200 gives acks to the host.  When there is no IO being done by the VM then there is no return IO for the host and ack gets sent after a time out.  This is what caused the high latency in this case.

I'm an HPE employee working in product management

Re: Storevirtual 3200 Latency Issue

Hi,

If I may butt in... We also happen to have a 3200/10G iscsi (portchanneled LACP, hosting 4 SSDs and 71 SAS drives), attached to a pair of 5700 procurves and half a dozen dl360 g9 servers (running ESXi 6.0U3 and one on 6.5 for tests).
The intriguing thing is we do get ~20 mb/sec on a vmware hosts running the regular iscsi software initiator; when doing iscsi within a virtual machine on the same host, we hit ~200 to 300 mb/sec throughput.
We do get a lot of TASK_SET_FULL (0x28) within vmware and it seems the array or esxi host is throttling for whatever reaseon, but didn't succeed in finding out why.

what did you tune within the ACKs ? we already tried delayed ack and checked for e.g. iscsi TOE settings, but to no avail.

Bart_Heungens
Honored Contributor

Re: Storevirtual 3200 Latency Issue

So what is the solution when seeing high latency? Host side or SV side? Will there be a software update? Or is it a OS tweak?

--------------------------------------------------------------------------------
If my post was useful, clik on my KUDOS! "White Star" !
My blog: http://blog.bitcon.be
richa3312
Regular Visitor

Re: Storevirtual 3200 Latency Issue

Hi,

Yes we never got a fix as such. The answer seems be that when there is no IO the ACKs are delayed which skews the latency we see on the storage. This is completley different to any SAS/FC MSA I've ever used where no IO means 0ms response. We also still see periods where the IO doesn't change but the latency drops from 25ms to 1-2ms for period of a few hours before jumping back up again. 

See the pics below its certainly not what I would expect. How does this compare with your unit?

LatencyLatencyIOIO 

 

Performance wise testing comes back with the expected results for the number, type and RAID setup so once loaded it seems to perform as expected. 

The only thing that seems odd is that the 3rd line chap I spoke said that the sv3200 won't cache random writes only sequential writes which I can't quite get my head around. I always thought that the cache was there to buffer writes and help even out performance. All the performance testing we have done with random writes seems to bear out that it is not cached by the device ie it appears to operate in write through mode. As mentioned in my original post we compared it to a DL360 with the write cache enable on the RAID card and the results are vastly different.  It would be nice to fully understand how the cache works as it seems different to other storage devices we have used in the past.

We're planning on monitoring it and seeing how it goes as we've had to bring the unit into production. Also we are planning on upgrading the unit with flash and additional spindle's in the not to distant future so it would be interesting to see if that makes any difference.

 

Re: Storevirtual 3200 Latency Issue

Hi, 

regarding the comparison: we are not in production - it is a test unit, to be used for every kind of virtualized test system we have on premises). It is kind of difficult to pin-point it, but we got some impression coming from raw numbers dumping (linux dd, 1MB blocks): 
vmware server with luns (vmfs), VM on top -> 20 mb/sec
vmware server, VM on some storage, iscsi from within VM -> 200 mb/sec
vmware server, VM on some storage, RDM pass-through -> 190 mb/sec

I´m not keen on debating whether it is 3 mb/sec more or less, it just seems vmware has some troubles with the SV, as other arrays do work fine. maybe VMFS has some kind of problem with iscsi on SV3200 (and not on LH4530, LH4730, netapp, MSA) ... or perhaps we missed something else... 

regarding the basic array performance, when doing iometer from within a VM or physical machine, we get the following numbers: 
nRAID0 8k 100% READ 68MB/s and 8.300 IOPS @ ~10ms (4 Worker, 20 Out IOPS)
nRAID10 8k 100% READ 90MB/s and 11.000 IOPS
nRAID0 2MB 100% READ max 190MB/s

the question as to WHY performance sucks with vmware/vmfs is somewhat unclear, the number of waitstates and throttles within the vmware log also does seem alarming. 

/btw: sorry for highjacking your thread, but I feel some comparisons and talking it out could be helpful ^^

SVprodmgr
Advisor

Re: Storevirtual 3200 Latency Issue

Thanks for the good data and observations.  If you feel that the latency or performance is not right on your SV3200, I would encourage you to contact HPE and open a case to get the experts to investigate.  There isn't enough information on this message thread to determine the root cause and it might be faster to let us look at each specific situation. 

Amy Mitchell

HPE StoreVirtual Produt Manager

I'm an HPE employee working in product management

Re: Storevirtual 3200 Latency Issue

be my guest - the initial performance call was 5315737219 and resulted in further testing and firmware update 13.0x -> 13.1x; the current call chain 5319053973 started with upgrading to 13.5x, re-initializing the array and retrying to set it up in a proper, performing manner. 

it may seem strange to put it that way, but it´s not about feeling the performance were not right - in fact, it just is not. there is no logical explanation for a factor 10 between native filesystems/iscsi and vmfs; having virtual machine vMotion with an average of 16 mb/sec is just plain sad. and - as I stated initially - the array itself would be fast enough, it´s not like I was hoping for performance figures in the all-flash-high-end arena...

as a sidenote, our array is called left and thus the controllers are (automatically) named left-SC1 and left-SC2. there is a bug in the initial assistant; when using the DNS names for the interface (e.g. https://left-sc1...) , it will start initializing and renaming the controllers and then crash with absurd error messages ( like e.g. "HAL Error") while creating the raid groups. 

Re: Storevirtual 3200 Latency Issue

hi all,

did a new set of performance tests with the new firmware (patch release on weekend, version 135-010-00, 8.5.2017).
The numbers got better, just for comparison:

initial storevirtualOS 13.5:
vmware server with luns (vmfs), VM on top -> 20 mb/sec
vmware server, VM on some storage, iscsi from within VM -> 200 mb/sec
vmware server, VM on some storage, RDM pass-through -> 190 mb/sec

storevirtualOS 13.5.00.0791.0
vmware server6.0 with luns (vmfs), VM on top -> 35-55 mb/sec
vmware server6.5 with luns (vmfs), VM on top -> 127 mb/sec
vmware server6.0, VM on some storage, RDM pass-through -> 93.9 mb/sec

All those tests imply sequential writes to disk, no cache (just dd, 20g size) and should demonstrate bandwith. I feel there still is some major issue with handling the esxi iscsi stack and VMFS, but finding out what exactly is pretty much out of my league. I'll keep you posted.

richa3312
Regular Visitor

Re: Storevirtual 3200 Latency Issue

Thanks for pointing out the update.

I've installed it last night so will monitor it over the next few days and see if anything changes. 

Thomas_Nielsen_
Occasional Visitor

Re: Storevirtual 3200 Latency Issue

Hi all,

We have exactly the same problem as described in this thread, very high latency 30-40ms, low IOPS 500-1000, and this on a system with 21 SSD's and 44 10K disks - ESXi 6.0 U3 with 10 Gbit/S  and 5900 series switches.

The performance in general is not great, but when measuring only with with reads - the latency drops 10x and the iops go up 5x 
We have a similar system at another client running Hyper-v 2016 - here the performance is mind blowing with 40K Iops at 0.2 MS latency - so the problem seems to be with the ESXi implementation.

The case have been given to the support team, and I will post any findings.

Re: Storevirtual 3200 Latency Issue

Hi,

another batch of tests coming through... ^^
I did some lowlevel linux testing, the measurements are not "valid" since stuff happens way too fast, but it gives some idea:

SV3200@VMFS:              read ~42K IOPS / 0,003 msec latency
SV3200@VMFS:              write 195 IOPS / ~5 msec latency
SASraid on physical host: read 70K IOPS / 0,001 msec latency
SASraid on physical host: write 3.349 IOPS / 0,29 msec latency

This was on XFS filesystems (the virtual machine was set up using the same kickstart as the physical one); one may argue the caches etc. matter here but the write IOPS on SV3200 give it away something is pretty wrong here.

I also did some new tests with windows/IOMETER using RDM (raw device mappings, 4 pass-through disks), they run as follows:

Windows Raw device mapping, NTFS volumes:
 nRAID10 4k 100% READ 74MB/s and 18.259 IOPS (average latency 1.8)
 nRAID10 4k 100% WRITE 24MB/s and 5.385 IOPS (average latency 6)

Same windows machine, having the bootdisk on VMFS:
nRAID10 4k 100% READ 27MB/s and 6.509 IOPS (average latency 19)
nRAID10 4k 100% WRITE 8,5MB/s and 2.000 IOPS (average latency 24)
 
Cheers
Guillaume

Re: Storevirtual 3200 Latency Issue

My performance is particularly bad on ESX, but it seems that writes are the main factor. I also have a physical Windows server connected to the array, and like @GuillaumeRainer above I see my Windows iSCSI random SSD IOPS drop from 15,000 @ 2ms to 190  @ 169ms.

richa3312
Regular Visitor

Re: Storevirtual 3200 Latency Issue

I notice you have SSD's, SAS and NL SAS in your box. I didn't think the SV3200 supported three tiers only two. Could this have something to do with it? My understanding is that the three drive types you have would all go in different tiers. 

 

Re: Storevirtual 3200 Latency Issue

The initial config wizard puts my drives in 3 tiers, 0 (ssd), 1 (sas) and 2 (nl-sas/sata), and documentation backs this up.

I've done my performance testing with RAID only created on the tier I want to test, as there is no way to pin a volume to a certain tier (or if there is, I haven't found it!).

Thomas_Nielsen_
Occasional Visitor

Re: Storevirtual 3200 Latency Issue

Hi everyone,

While support is working on the case, this mail came from HPE today:

http://h20566.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-a00009136en_us&hprpt_id=HPGL_ALERTS_1967118&jumpid=em_alerts_us-us_May17_xbu_all_all_1017565_1967118_StorageOptions_critical__/

I have not tested the solution yet, but will do this asap.

Hope this helps some of you.

Best regards

Thomas

Re: Storevirtual 3200 Latency Issue

Hi all,

Thomas, thanks for sharing - I did a number of tests this morning (just before getting news from Support), with mixed feelings. Basically, since running tests and being unable to use this array in a somewhat productive way, I just have a single node attached to a given volume.
So, running my usual benchmarks (e.g. dumping 20g straight in, reading 4k blocks from the dump) did not really change - maybe the figures would be something else when vmware clustering.

Support, in contrast, urged to try out the said ats heartbeat fix; at the same time, they wanted another round of good old windows raw device mapping iometer tests (which I feel have been done more than enough).

Now, since HP is reading this anyways: Thing is, I do not care anymore if windows or BSD or plan9 got some iops numbers when running infiniband over USB-1 or firewire; that array has been put in place to provide storage for ESXi and it seems to be buggy in this exact combination.
No problem, HPE - fess up, build your own test rig, check it, fix it or advise or show how it´s done, and be done with it. But don´t have me running for half a year with a 60k€ array doing benchmarks and console me with a "Customers need to do performance tuning with help of any performance consultant at their end, based on requirement".

Johannes_we
Advisor

Re: Storevirtual 3200 Latency Issue

I have that latency issue as well but with a box that´s running SSD and NL but a volume that is AO permitted so all of it is kept in the SSD Tier.

But i still dont get more than ~100MB/s out of it when testing a vmdk on a iSCSI VMFS...

Re: Storevirtual 3200 Latency Issue

just to figure out some bits: a friend of mine (I worked as something like a presales consultant at HP at some time 12 years ago), brought up the SV3200 initially had some problems doing 802.3ad bonding.

I was puzzled, because we did 802.3ad because it was dog slow on regular ALB bonding. Since the first 13.1 update, we have always had active-active bonding; currently, I'm trying to talk our network guys to disable the portchannel to do yet another performance test, this time on ALB.

So, are you using ALB or 802.3ad/LACP on your 10G interfaces ? 

Re: Storevirtual 3200 Latency Issue

I have tried ALB, active-passive and 802.3ad on 10GbE. Performance seems to be equally poor, regardless of what I choose.

Johannes_we
Advisor

Re: Storevirtual 3200 Latency Issue

I´m using ALB

Re: Storevirtual 3200 Latency Issue

thx, to me it also seemed as if the bond type doesn't really matter.

any news on the calls ?