Re: One server cripples an EVA 8x00!

Kurt Gunter · ‎04-22-2010

Our company owns a large number of EVA SANs, most of them being 8000s, 8100s and 8400s.

For a while now we’ve been receiving enduser complaints from one of our departments. This department has an EVA 8100 fully-stacked with 168 300GB 15K FC drives. They run a mix of VMs (ESX 4) and physical servers of various types. Amongst them are a number of Windows SQL clusters and UNIX MySQL servers, all leveraging a number of LUNs ranging from 50GB to 2TB (all vRAID5).

Their complaint was that each of their servers – UNIX, VM and Windows – periodically show alerts for bad performance from the EVA. We’ve spent 6 months investigating this and doing all kinds of stuff such as patching the Brocade firmware, balancing paths, and implementing best practices. Oh, we covered the EVA as well – HP techs were involved to examine the SAN for any possible hardware or configuration problem. They found an internal loop which was noisy and replaced it, but the problems continued.

As we ruled out more and more things, we got closer to the conclusion that it was the EVA itself that was the problem. But it didn’t make sense. Our setup is rated for at least 10,000-15,000 IOPS, and these guys were using an average of about 3,000, spiking to 6,000. We have an internal program that uses EVAPerf to collect and dump stats into a database, using a figure averaged over 5 minutes per sample. None of these graphs – IOPS, controller load, disk stats, anything – showed anything amiss.

(continued…)

Kurt Gunter · ‎04-22-2010

The first hint I got of the problem was when I investigated another issue. A few drives had recently started to fail in our EVAs. Specifically, the latest batch of Seagate 300GB and 450GB FC disks, but Iâ m not sure that this matters. In any case, these drives (about 4 of them in 3 different 8100s and 8400s) were experiencing â check condition errorsâ which turned out to be physical issues with the drives. The drives were failing, but had not yet tripped the S.M.A.R.T. thresholds for some reason. Apparently the EVA uses S.M.A.R.T. status as an indication of drive health. So these disks were having to retry numerous times on writes before they succeeded.

The interesting thing about all this is that these single drive issues were impacting the ENTIRE SAN. In fact, the hosts connected to the SAN would experience high latencies on writes, which would be interpreted by the host as â access deniedâ when Robocopying files and so on. HP investigated and confirmed on two separate occasions that the disk was the cause. I was able to see this too when I launched EVAPerf and saw large (300-500) queues on the physical drive that was failing. We replaced the faulty disks and voila, the problem disappeared. I demanded HP investigate and they have since come back saying that Seagate seems to be setting their S.M.A.R.T. failure thresholds too high, causing disks to fail before they are marked as failed by the drive firmware. A cheap trick on the part of Seagate to reduce their RMA volumeâ ¦ but anywayâ ¦ HP says that in their next EVA firmware version they will be overriding the S.M.A.R.T. thresholds and letting the EVA make its own decisions as to if a disk is bad or not.

The disturbing part about all this is how this affected the entire SAN. Everyone knows that a basic principle of any SAN is that a single disk failure should not impact data, or array performance. Here we have a single disk failure causing catastrophic consequences for EVERY SINGLE LUN which holds data on that failing disk â that is to say; all of them. And where is the EVA cache in all of this? It was not full, and surely should be caching writes and then feeding them to the disk, buffering the problem from the enduser. This was clearly not occurring as it should.

So, back to the problem at handâ ¦ you will see how this fits in later.

(continuedâ ¦)

Kurt Gunter · ‎04-22-2010

Running EVAPerf I noticed that most of the time, things were OK. But occasionally I would see some interesting stuff on the host port. The queue length on all 8 ports would suddenly bound upwards from 0 or 1 and reach 10, 20, 30, even 40â ¦ at the same time, the IOPS were not excessive and the HBAs were not saturated from a bandwidth perspective. Hereâ s an example:

Name Read Read Read Write Write Write Av. Port Ctlr Node
Req/s MB/s Latency Req/s MB/s Latency Queue WWN
(ms) (ms) Depth
---- ----- ------ ------- ----- ------ ------- ----- ------------------- ---- ---------------
FP1 61 3.37 16.4 1921 12.38 4.4 17 5000-1FE1-5011-7488 8022 MDC-SAN-EVA8K03
FP2 47 2.80 11.8 1115 11.17 4.7 10 5000-1FE1-5011-7489 8022 MDC-SAN-EVA8K03
FP3 66 3.64 12.1 1475 13.33 4.8 13 5000-1FE1-5011-748A 8022 MDC-SAN-EVA8K03
FP4 71 3.67 12.3 1143 13.03 4.7 11 5000-1FE1-5011-748B 8022 MDC-SAN-EVA8K03
FP1 532 10.93 16.3 1342 14.76 6.2 21 5000-1FE1-5011-748C W05Y MDC-SAN-EVA8K03
FP2 380 13.53 15.9 1315 14.50 6.2 12 5000-1FE1-5011-748D W05Y MDC-SAN-EVA8K03
FP3 224 14.16 19.7 1179 19.87 7.5 16 5000-1FE1-5011-748E W05Y MDC-SAN-EVA8K03
FP4 65 2.99 14.8 1146 14.36 7.0 16 5000-1FE1-5011-748F W05Y MDC-SAN-EVA8K03

This wasnâ t showing in my graphs because it was â burstyâ â remember that my graphs took the average over 5 minutes, and this irons out short spikes. The only way I would be able to capture this was by manually rolling EVAPerf and analyzing the output to see if Iâ d caught it when it was spiking. And before long, I had trapped a few examples.

I took this EVAPerf output and examined it. I looked at the controller and cache. Both were working hard, but not too hard â one controller was at 77% and the other at 30%. I looked further down, and further, and I found the cause at the physical disk level.

All the disks in the array had very short queues â EXCEPT a handful where the queues were hideously large. I found a queue length of over 1,000 at one point, on a single disk. What could be causing this?

(continuedâ ¦)

Kurt Gunter · ‎04-22-2010

The answer was lying in the drive stats. The drive was receiving 200+ write requests, all to less than 1MB of data in total. Essentially, something was performing many small nonsequential write commands on the same small segment of data. For some reason, these writes were all hitting a handful of disks. Why?

The logical answer was because of the data striping. Data is striped across disks; itâ s how RAID0 works (and the EVA is at base a RAID0 array). It so happens that the stripe size in the EVA isâ ¦ 8MB! With a stripe of this size, you can easily run into issues. A small amount of dataâ ¦ say 60KBâ ¦ will NOT be located across multiple disks; itâ ll be located on just ONE physical disk. Therefore, writing to this one file will result in command to the ONE disk that holds the data (as well as the parity) and not multiple disks.

So you have situations like this happening:

ID Drive Drive Read Read Read Write Write Write Enc Bay Grp RSS RSS Ctlr Disk Node
Queue Latency Req/s MB/s Latency Req/s MB/s Latency ID ID Index WWN
Depth (ms) (ms) (ms)
--- ----- ------- ----- ------ ------- ----- ------ ------- --- --- ---------------- --- ----- ---- ------------------- ---------------
91 227 - 48 0.48 386.0 207 0.53 576.7 11 12 Default 20 1 8022 2004-000C-CA99-ED44 MDC-SAN-EVA8K03
140 275 - 54 0.51 594.0 216 0.58 813.0 6 12 Default 20 6 8022 2008-000C-CA4A-D0AC MDC-SAN-EVA8K03
141 220 - 51 0.46 405.8 208 0.56 639.1 13 7 Default 20 7 8022 2008-000C-CA47-B3F0 MDC-SAN-EVA8K03

In this case, someone has an SQL server which caches a whole bunch of commands and then â burstsâ them to the database all in one shot â many small nonsequential writes (just a few bytes) all located in the same small data area. These writes hit the EVA, and promptly get delivered to the disk, which chokes because itâ s trying to commit 200 nonsequential writes per second. Not only does this cause write latency (the poor drives seem to be only doing 0.5-0.7 MBps), but anything else which is also trying to access data stored on this drive has to sit in the driveâ s queue and gets slowed down as well.

Kurt Gunter · ‎04-22-2010

(continuedâ ¦)

Result: one SQL server can critically compromise an ENTIRE EVA with a simple burst of small write operations.

But wait, you say, the EVA has a write cache! Surely any writes that the drive canâ t handle should be stored in cache? After all, we are talking about bursts which last a few seconds and contain less than a hundred MB of dataâ ¦

Well, I would think that too, only it doesnâ t seem that the EVA is doing this. My evidence clearly shows the link between the disk choking and the queues backing all the way up the EVA and out the external HBA.

I still would doubt the evidence of my own eyesâ ¦ surely HP canâ t have a problem THAT bigâ ¦ except that, as I explained above, we have confirmed cases where a single failing drive kills the EVAâ s performance as well. I would expect the EVA to cache the writes there tooâ ¦ but it is not.

As I looked further, I saw the same kinds of behaviour occurring on ALL our other EVAs when placed under the same conditions. This is NOT limited to just one array.

Soâ ¦

I suspect there is a huge hole in the EVAâ s caching logic. This is a severe issue as it basically means one server, or one failing disk, can easily cripple an EVA. Having a billion-dollar business rely on arrays which can fall over from one server, and having spent many millions of dollars on our EVA infrastructure, I am angry beyond words to see this kind of behaviour.

If anyone wants to see the logs I have collected, or ask me any more questions about this, I am more than happy to do so. I would love to be proved wrong, since that would mean HP gets to keep its multi-million dollar contract with our company and my bosses donâ t start making calls to Legal. But we already have several HP-confirmed cases of single disk failure causing crippled EVAs, and this issue follows the same pattern.

Any comments?

Sheldon Smith · ‎04-25-2010

The EVAs *must* be under a support contract. This is what HP Support is for! Call Support, get a case opened. Tell them you have critical production issues.

Note: While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company

Kurt Gunter · ‎04-25-2010

Absolutely - HP is working on the case now and I hope to have a response from them last week.

But I'm also interested in hearing the opinions of people who use EVAs for Production. Has anyone seen this kind of behaviour before? Any gems on how the EVA cache logic works? Any tricks or tips?

I have a nasty feeling that HP is just going to tell me to split the LUN out onto a new disk group, which is not a useful response since we're already full (90% utilised) so there's no disks left to do that. We'll see.

erics_1 · ‎04-26-2010

Kurt,

A lot of very interesting information here. I would very much like to hear what becomes of this. We have and continue to have intermittent issues that at least appear similar on the surface to what you've experienced. Have you heard anything back from HP?

Thanks,
Eric

ColinF · ‎04-27-2010

Do you see the same results if you do the same SQL transactions on a vRaid1 LUN?

Kurt Gunter · ‎04-27-2010

I actually haven't tested that yet.

I'd expect we'd see a drop in the latency because we'd have half the IOs going to disk compared to vRAID5. But wouldn't this be addressing a symptom and not the cause?

In a traditional RAID volume with a dedicated parity disk, I could see one disk becoming potentially overloaded as it tries to store parity data from the rest. But as I understand it, an EVA is a collection of RAID0 disks with other RAID levels "striped" over the top. This would indicate that parity data is also striped across the set, meaning that no parity operation shuld place an extra load on any one disk. Therefore, parity would not be the problem in an EVA and we're back to thinking of writes in general being an issue, which would take us back to the situation where a move to vRAID1 would mititate the symptoms but not the cause. Correct me if I'm wrong...

Taking the array as a whole, as I mentioned we're not even close to the theoretical capacity of anything other than disks - and most of the disks are just fine. We're just seeing several standouts in the entire disk set - and which ones are the standouts seem to change all the time.

Hmmm...

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: One server cripples an EVA 8x00!

One server cripples an EVA 8x00!