1854806 Members
28239 Online
104103 Solutions
New Discussion

EVA - asyncio

 
SOLVED
Go to solution
Stijn V
Regular Advisor

EVA - asyncio

Hello,

We are running several Linux ESX servers, their (vmfs) storage is located on our EVA3000/6000 SAN (LUN).

We are receiving the following error on those Linux ESX servers in their system log files:
SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 311286, handle 1fb16/0x720eb08

VMWare ESX server support sais that there are performance problems on our EVA systems, IO timeout of 5 seconds !!!

Up to me this is not possible, we do have a lot of applications running on our EVA's. If we realy would have 5 seconds IO timouts on regular basis, we would have -up to me- a lot of issues on other servers (Unix/Windows as well??).

How Can I verify this on our EVA systems.
7 REPLIES 7
Jonathan Harris_3
Trusted Contributor

Re: EVA - asyncio

Could be lots of things, but just a couple of thoughts.

The first place I'd look would be the throughput on your ESX fibre ports. Are you running lots of services on ESX. Are you driving high capacities through the port(s). Are the ports contended on the switch?

Secondly, how have you distributed the LUNs across the EVA 3000 and 6000? Bear in mind that the 6000 will give better performance and that it supports active-active, whereas the 3000 only supports active-passive. If you're splitting related LUNs across these arrays, it could give you issues.
Stijn V
Regular Advisor

Re: EVA - asyncio

We do have several LUNs (vmfs) whith a lot of VM (Virtual Machines)... There is probably a lot of IO generated by those VM.

One LUN is 750 GB and hosts +/- 20 VM...

How can I easily monitor the fibre througput?

We do hava a mix of EVA3000/6000 LUNs presented to those ESX servers (but that shouldn't give problems?)
Uwe Zessin
Honored Contributor
Solution

Re: EVA - asyncio

You can check several layers with EVAperf (it is usually installed on the management server where Command View EVA runs) or take a look at the Fibre Channel switch counters (on Brocade ones it is "portPerfShow").

It is also possible to check within VMware ESX server using "esxtop" - see attachment.
.
Stijn V
Regular Advisor

Re: EVA - asyncio

Thanks.

It seems that the load is ok (checked via the esxtop and portperfshow commands)

there are some spikes of 100M total (portperfshow)... But everything (HBA and switches) is 4GBps seconds.

The only thing that I see is that the managing controller for all ESX LUNs is the same, ie controller A (I guess the default one).

Would it increase performance if I load balance them over controller A and B? If so can this be done online (and how)?
Uwe Zessin
Honored Contributor

Re: EVA - asyncio

If you balance you virtual disks across both controllers, you at least win the CPU and cache capacity of the second controller.

You can easily change the preferred path setting within VMware ESX server connected to an EVA4000/6000/8000 as it is an active/active array and you can do I/Os through the non-owning controller. If the firmware is current, the EVA can detect this situation and automatically move ownership of the virtual disk to the other controller.

On the EVA3000/5000 it is not that easy, because the failover policy should be set to MRU (most recently used), because I/Os can only be done through the owning controller.

I have never tried it myself, but maybe you can force a failover by setting a fixed path on one of the servers, have the others follow the failover and then set this server back to MRU.
.
Jonathan Harris_3
Trusted Contributor

Re: EVA - asyncio

Performance would be improved by balancing the load over both controllers, but it's not the source of your SCSI timeouts.

Have a look at this thread: http://www.wmware.com/community/thread.jspa?messageID=653365

Stijn V
Regular Advisor

Re: EVA - asyncio

Thanks, but up to VMware support: The Async I/O timeout means that the vmkernel asks the VM to Retry the I/O !!

This means that the IO queue gets full.... and the queue would not get full unless there
is a slowdown on I/O response ...

And if load balancing speeds up performance, I would think that it could solve our issue (making the queue depth longer is indeed a workaround -> but why is the queue getting full ??).