MSA Storage

MSA1040 extreme (1 to 40s) latency

 
blistovmhz
Occasional Visitor

MSA1040 extreme (1 to 40s) latency

A client has an MSA1040, dual controllers, 20 discs, all 10k SAS.
I know very little about their infrastructure as I've only just started working with them, but I've consistently noticed very poor disc IO performance in their VM's. 3x hypervisors, each with 2x dedicated 1GB ethernet for iSCSI (completely isolated from the rest of the network), going into two Aruba switches, and then each SAN controller has 2x 1GB ethernet to each switch as well.

Under extremely low load situations, everything looks normal with 2-4ms storageIO response time at the hosts.
Maxing out the reads at around 150MBps, the latency to both controllers is still sub-10ms.
If I do any sustained writes over 20MBps, the io latency jumps to 1000ms, then 5000, then 15000, then as high as 40000.
The read and write IO are both affected, but only while a write is occurring.

I couldn't find anything wrong with the VMware metal, nor the switches, and given it only occurs on writes, it can't be network.

Engaged HP support a week ago, and their suggestion of course was to update controller and disc firmware. I asked for instructions/documentation and they assured me that the entire update could be performed safely online. For various reasons, they just didn't inspire a lot of confidence in me, and i wanted to speak to someone in lvl2 to discuss. 
A week later lvl2 finally calls me back and after essentially telling me I'm stupid for an hour (politely), I finally got outta him that, according to HIS documentation, updating the disc firmware online, would result in ... bad things. He assured me this had to be done offline (we've already scheduled a lights on,  but low IO maintenance window but we can't go dark during this period).

ANYHOW. I finally wised up and figured out I can look at some actual metrics and data if I ssh into the controller, and this is the first thing I see:
# show host-port-statistics
Durable ID           Bps                IOPS             Reads            Writes           Data Read        Data Written     Queue Depth      I/O Resp Time    Read Resp Time   Write Resp Time  Reset Time                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
hostport_A1          1763.8KB           93               13279243780      14461709666      488.3TB          175.0TB          0                7843             19477            7331             2016-09-30 21:10:46        
hostport_A2          0B                 0                0                0                0B               0B               0                0                0                0                2016-09-30 21:10:46        
hostport_B1          0B                 0                0                0                0B               0B               0                0                0                0                2016-09-30 21:10:33        
hostport_B2          420.8KB            53               25555859824      7425464984       1530.4TB         182.0TB          0                8016             8606             7885             2016-09-30 21:10:33        
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So what is I/O, read, and write resp time? I presume that's in milliseconds, and the lvl2 agent confirmed this (many times). That's 7.8s IO latency, 19.4s read latency, and 7.3s write latency?
Support guy didn't think these were "bad" numbers, and did say they did in fact, represent latency, and that they were in milliseconds. Of course, he has to be wrong about at least one of those points.

Does anyone know definitively what those values are, and what units they are? Is this my insane latency?

Also, in the controller webUI, if you go into performance > discs > history, it shows a graph with spikes of up around 20 errors, every time we experience this latency (ie: every time we do a big write). This is the case on all 20 discs on each controller, and the spikes coincide perfectly with known large write workloads.
Again, we can only move 150MB/s peak anyway, as we've only got 1GB ethernet, and IOPS never goes beyond 100 per controller at highest.
HP support couldn't tell me what "errors" means, nor what it's in reference to, nor if that's errors per second, or over N time, ..  yaknow.. I know as much as they do I guess?

6 REPLIES 6
blistovmhz
Occasional Visitor

Re: MSA1040 extreme (1 to 40s) latency

Addendum. I just found a manual for the MSA 2040. Figure it'd be close enough. Read Resp Time (and friends) are apparently in milliseconds.

Re: MSA1040 extreme (1 to 40s) latency

@blistovmhz 

I am not sure which document you have referred but IO, read and write response time what you see in the output you have shown represents in microseconds. You can use any converter and convert microsecond to millisecond. I always use Google.

Please find MSA Command line guide (page no 317) to see details,

https://support.hpe.com/hpesc/public/docDisplay?docId=c04957376

 

Hope this helps!
Regards
Subhajit

I am an HPE employee

If you feel this was helpful please click the KUDOS! thumb below!

************************************************************************


I work for HPE
Accept or Kudo
blistovmhz
Occasional Visitor

Re: MSA1040 extreme (1 to 40s) latency

Thanks sir! Yea, I found that document last night as well after writing the post, and indeed it does specify microseconds. I wonder if the other doc I'd found was out of date?
The lvl2 agent I spoke to also confirmed at least 10x that it was milliseconds, and I kept asking him if he was absolutely positive it was ms and not us. He kept insisting it was ms, but also didn't seem to understand that 20,000ms latency was high (though it would coincide with the real latency we're seeing at the storage path).,
Even that being the case, we're still looking at 25ms latency when the array is practically dead quiet. IN that snapshot there, we've only got 1.7MB/s throughput and 20ms read latency, which seems awfully high. In any of my other environments, I typically set alarm triggers if any disc latency ever increases over 20-25ms, as I generally see no higher than 10 on a 10k SAS, and that's not till at least 200+ iops. Is 20+ms latency normal on these units with 10k SAS discs?

And either way, we've still got 1-40 second latency by the time we get to the vm hosts, and so far the only suggestion I've got to go on from HP thus far, is to update all the firmware (which again, I was told to do controller and disc firmware online and I still don't know if that's possible/safe, as I keep getting different answers).

Do you have any recommendations or idea's as to what might be the cause here? We'd scheduled to have everyone out of the building Monday morning to decrease IO load while we performed the recommended firmware updates, but now we have no idea if the SAN has to be completely offline to do the discs. Most of the docs I've found contradict what the support people have told us.

blistovmhz
Occasional Visitor

Re: MSA1040 extreme (1 to 40s) latency

Ah, one other question that I haven't managed to get an answer for.
If the controller does in fact have to be offline during a disc firmware update, can we not just simply take a single controller offline, update it's controller and disc firmware, fail the storage path back to the newly updated controller, and then perform the upgrade to the second controller and discs? I'm not understanding why this would have to be done fully lights out?

Re: MSA1040 extreme (1 to 40s) latency

@blistovmhz 

Regarding Controller firmware update, yes it's online activity. However we recommend to do any firmware update offline after taking proper data backup. This is just to avoid any un-necessary situation.

Coming to drive firmware update, this is always offline activity. You can validate the same from any drive firmware installation instructions.

For example, you can go through below link and search with drive model you are using then open firmware download link. You can go to Installation Instructions tab and validate the information.

https://h41111.www4.hpe.com/storage/msafirmware.html

Now coming to performance, it's difficult to comment anything without checking logs and data for the entire setup. My recommendation would be in drive level always check IOPs and not latency. From Vdisk or VDG or Volume or Host port level you should check latency. You should always look at the impact rather then values. That may help.

 

Hope this helps!
Regards
Subhajit

I am an HPE employee

If you feel this was helpful please click the KUDOS! thumb below!

**********************************************************************

 

 

 


I work for HPE
Accept or Kudo
Cali
Honored Contributor

Re: MSA1040 extreme (1 to 40s) latency

> If the controller does in fact have to be offline during a disc firmware update, can we not just simply take a single controller offline, update it's controller and disc firmware

This will not work, because both Controller have access to all Disks.
This is the Disk Firmware of your 20 SAS Disks, not any Disk inside the Controller.


======================
I'm not an HPE employee, so I can be wrong.