EVA / ESX Problem


I have a customer who has a EVA 4400. The EVA has 60 15k 146GB FC-Disks and 8 1TB FATA Disk.
The FC Disks have about 30 LUNs holding different DataStores for ESX. Every LUN (datastore) has about 6-8 VMs. Some LUNs are presented as RAW devices to ESX.
The FATA Disks are used for Backup purpose. Some LUNs are used for VCB Backups, some LUNs are used as VMDK Stores for backup.

In the beginning we had some problems with the EVA. Controller Errors, I/O-Module Erros, VDisk Errors, and so on.

HP was onsite changed some hardware (controller, i/o modules, cables). afterthat the newest XCS and CV was installed, cause some failures in the starting xcs version were known.
Now the last 4-5 weeks these problems disappeared and EVA was working good.

Until last week. The EVA starts to have some strange effects.
A VCB Backup (or a copy job) to FATA Disks and simultaneously a Backup Job to VLS. After 10-15 minutes, the VMs on the EVA were working slower and slower, until the VM freeze.
Stopping the backup/copy and everything works normal again.

First I thought, that problem can be as a result of some performance bottlenecks. So I made some performance analysis.

Max. IOPs during Backup (1 h) : 6000 IOPs, the othertime it's about 600 - 1500 IOPs.
This seems to be ok on that configured EVA, because during backup (incr.) all datastores are delivering (read) data.

On the FATA Diskgroup latentcy is reaching about 1000ms. I know, that FATA Disks are slower, and that probably the 8 disk configuration is really the absolute minimum.

Today as I was onsite, the same effect happens. I observed, that even the CV-Server, which does not have any lun presented had different SCSI-Resets on \SCSI\Raid0.

So I try to ask the community for some help.

- How much latency is tolerable ?
- What happens with the EVA if latency is to high ?
- Can some problems on one DiskGroup (FATA) affect the whole array ?
- or are these problems not at all performance related, but some hardware errors ?
(EVA logs and switch logs do not display nothing)

Any inputs, tips and hints are very welcome...
Víctor Cespón
Re: EVA / ESX Problem

- How much latency is tolerable ?
FC disks: 15 ms read, 5 ms write
FATA disks: 25 ms read, 5 ms write

- What happens with the EVA if latency is to high ?

I/O operations get queued, hosts see that the read and writes take too much time

- Can some problems on one DiskGroup (FATA) affect the whole array ?

Yes, this has been seen on many cases

- or are these problems not at all performance related, but some hardware errors ?

We need an EVAperf capture for this, to see what latencies each disk group and each individual disk has.

You must consider that a disk group with 8 FATA disks has a very low performance limit.

8 FATA disks + LUNs in VRAID 5 + mostly writes = Bad idea

This usually leads to the very high latencies you're seeing.

Run this: EVAperf all -cont 5 -dur 3600 -csv -fo data.csv

Compress and attach here

Re: EVA / ESX Problem

Hi thanks for Your answer

sorry for the late answer, was working around the whole day.

I know that this FATA-DG is a problem.

But that some (if any) problems with this DG affect the other DG ? Is it possible to isolate such a situation ?

I would also note, that apart from the FW related problems, the customer has worked like this for months, without problems.
They only added some more VMs on the FC-DG.

Attached You find the related file. I had to reduce the amount of data, since compressed file is bigger than 1 MB

Thank You in advance for Your help.
Víctor Cespón
Re: EVA / ESX Problem

Try sample at a lower rate

EVAperf all -cont 15 -dur 3600 -csv -fo data.csv

To take a sample every 15 seconds for example.

I can't see performance problems on the data you sent.

Default disk group has read latencies below 5 ms and write latencies below 1 ms

The FATA disk group has write latencies of 25 - 30 ms, typical for FATA disks.

You should run this data capture when there's performance problems.

See attached file with charts

Re: EVA / ESX Problem

Hi (unfortunately i don't know Your name)

for first, thank You very much for Your aid. It's allways good to have someone to discuss about a problem. I will for sure assign apropriate points at the end.

Yes You're right, the system is working normal during normal operation.

The problem arises, when
- Backup from one specific drive e: on one specific server starts (about after 3-4 hours). Backup of drive d: has no failures. Both disks are mapped as raw disks.
- VCB Backup of VMs to the FATA pool
- SecureCopy Job from a physical server to virtual server on FATA pool

Because the distrubtion of the eva concerns a production server, which has to be 100% online during 24 h (for the moment), it's not allowed to make these tests and analyze the performance.

The executive give us the opportunity, to do these tests saturday (only). We will do these analysis and i will post some results as soon as possible.

But one other question, perhaps You have more experience.
On the lasts reports i post, on the portstatus i see high numbers on fabric ports (FP) and Device Ports (DP) for
- DeviceBadRxChar
- LinkFail
- LossofSynch
is this negligible (they do not araise as well on logs made today). Or are these symptoms for some problems on controller level ?

Attached other evaperf logs. You will see, at 20:00 starting the backup job. But again, FATA Diskpool not involved.

As I understand, again no performance issue.

Thank You a lot for Your aid.
Víctor Cespón
Re: EVA / ESX Problem

Hi, answering each point:

Running EVAperf does not impact production, we have set up data capture in some customers each minute for 24 hours. This lets you see how the traffic varies during the day.

The portstatus numbers are the total since the EVA was powered on, you must see if they increase with time

In this capture there are no performance problems either. The default disk group of 66 15K disks can easily handle the 3500 IOPS.
The FATA disk group is not being used at all, so no impact.

See attached screenshots
Re: EVA / ESX Problem

