Re: LSI_SAS Errors. Related to P4000 SAN?

Peter J West · ‎09-29-2011

Hi,

We have a new Virtual Environment and SAN which consists of 3 x HP DL380 G7 Servers running VMWare vSphere 4 and 6 x HP P4500 G2 Storage Nodes. The nodes are configured as a multi-site SAN with 3 nodes at each site.

So far i've been impressed with the performance but more recently we've spotted a strange performance issue whereby the Virtual Servers (which are running Server 2008 R2) will freeze for a short period. They then continue as if nothing happened, but during the down phase they're not contactable and it's causing some strange issues with our applications.

Digging through the logs it looks like the issue relates to an error in the event log which says:

Reset to device, \Device\RaidPort0, was issued

Source: LSI_SAS

In this case i'm guessing RaidPort0 is the main operating system volume. The current configuration for OS installation sis that we advertise an 800GB volume from the SAN to all 3 of the ESX Servers (we have to do this for vMotion to work) and then we create the VMDK file for the OS within this volume. So far we have around 13 Servers all running from this volume, and my gut feeling is that perhaps this is overloading the volume with IOPS.

So, a few questions:

1. Is the way we're 'slicing' the disk the best way of doing it - or should we be using smaller volumes and spreading the OS volumes across them? Is there an advantage to this?

2. Are there any published statistics on what the maximum IOP/s values are? The graphs for performance in the CMC are great but they don't really mean a great deal if you don't know what a 'bad' value is.

3. Does anyone have any more idea on this issue and what we can do to resolve it. I've checked across a number of the Servers and they all seem to have similar errors despite being on different physical hosts.

I also found mention of the above issue and a patch was mentioned but that related to SANiQ version 8 and all of our storage nodes are running the latest version of SANiQ. I've also installed the appropraite Solution Pack on each of the VM's which also includes the MPIO component.

Grateful for any suggestions.

Pete

RonsDavis · ‎09-29-2011

In VMware's documentation there is a little checklist you should go through when setting up iSCSI.

http://www.vmware.com/pdf/vsphere4/r41/vsp_41_iscsi_san_cfg.pdf

In this version that's on page 99.

The item that concerns me, and may concern you is

Set Disk.UseDeviceReset to 0

By default this is set to 1, the other option they list is Set Disk.UseLunReset to 1.

If I'm reading correctly, you can either reset the device, or the LUN. In the case of a P4000 hosting multiple LUNs, you don't want to send Device resets, since that may affect multiple LUNs.

Peter J West · ‎10-03-2011

So basically change UseDeviceReset to 0 for each of the ESX hosts? UseLUNReset should be left at it's default of 1.

I'll give that a try.

Thanks

Pete

lando_uk · ‎10-03-2011

Use smaller datastores and balance them across the cluster. If you just have 1 large VMFS then you're limiting yourself to 1GB/s for all of those VMs, where as if you have 3 nodes at each site, create 3 VMFS volumes so that each node holds a gateway, potentially giving your esxi 3GB/s to play with.

You still might see pauses and slowdowns when doing storage vmotions, but that's the nature of the beast.

Peter J West · ‎10-03-2011

Thanks Lando,

In terms of storage throughput though i'd hope we're getting > 1gbp/s. We have 3 NIC's in a round-robin configuration which means our theoretical limit for storage traffic is 3gbp/s for each of the ESX Servers?

At the moment the VMFS volume for operating system files is around 800GB, but i've checked back and noticed we've got these LSI_SAS errors occurring on Servers going back as far as June 2011. Back at this time we only had one or two Servers running in the virtual environment.

Strangely we're not seeing massive performance issues either at VM level or within the SAN, but we do have at least 2 hosts which seem to 'lock-up' occasionally and this is giving me cause for concern.

Thanks for the comments so far - they're really appreciated.

Kind regards

Pete

lando_uk · ‎10-04-2011

Your lefthand nodes have only 2 x 1GB/s NICs. If your VMs are all on a single VMFS then the limit for that datastore is 1GB/s, even if each of your hosts are using 3 x 1GB/s using round robin. This is because your esxi host can only read and write to the p4500 node that your VMFS is located on (its gateway) ESXI doesnt have a DSM driver like Windows does, so it cant read from all the nodes in the cluster at the same time.

Even though each node has 2 nics, you can still only really utilise 1 when using Network Raid 10, as it uses the other nic for grabbing the other chunks of data from the other nodes.

A simple test you can do to see this, is run a iometer test from one VM in datastore-a, do a sequential read. If only one VM is running in that datastore-a you'll see about 100MB/s transfer. Fire up another VM in the same datastore and run the same test at the same time, and you'll see both VMs run at about 50MB/s, do a 3rd VM and it'll drop down to 33MB/s and so on....

So when you only have 1 datastore, a single VM within that datastore can potentially impact all the other VMs.

So if you have a 3 node cluster, create 3 datastores and divide your VMs up, seperate out heavy IO VMs like SQL servers and dont have them on the same datastore.

With multiple datastores on the same cluster there are other issues to consider (VAAI issues), but that is a whole other story!

All this might not fix your problem, but its good practice and will improve performance.

Bottom line is, to get decent performance from Lefthand and VMware you have to use 10GB/s nics.

Peter J West · ‎10-05-2011

Hi Lando,

Thanks for explaining that to me - it makes a lot of sense.

Currently the LUN defined for the operating system installed is around 800GB and we have 14 virtual machines all using this single volume for their operating systems.

This clearly isn't good and your explanation gives me a clear understanding of what the problem is.

So, my plan is to start by creating 4 smaller volumes, each of 240GB. We'll then have no more than 4 VM's using each for their operating systems but at the same time we'll consider the storage footprint of each Server.

We did look at the 10gbp/s versions of the Lefthand solution but the cost of 10gbp/s switching gear was just prohibitively expensive for us at the time the project was going ahead.

My understanding is that the current configuration should be fine, but we just need to slice the disk differently and not load all the OS installs onto a single volume.

Pete

Bart_Heungens · ‎10-05-2011

Hi,

Can you specify a little more around issues with VAAI? I have done tests with it and they were really satisfying... Never had issues with... Deploying templates goes much faster than without VAAI enabled...

I can agree completely on the fact to have more than 1 VMFS volumes... If you enable Load Balancing correctly, you can go up to 2 Gb/s since it will spread the load accross the 2 NICs.. If you have at leastmultiple volumes and VMs off course...

Kr,

Bart

--------------------------------------------------------------------------------
If my post was useful, clik on my KUDOS! "White Star" !

Peter J West · ‎10-06-2011

Stupid question time maybe? What's VAAI?

Also, with regards SAN throughput. Within vmWare we have a vSwitch defined just for SAN traffic and i've then pooled 3 vmNics to that switch. The switch failover order on each NIC is overridden and an active adapter is manually specified.

My understanding was that this configuration will give us a combined throughput of 3gbp/s and also fault tolerance. The failure of any 1 of the 3 NIC paths shouldn't cause the vmWare host to lose network connectivity.

I'm now wondering if my vmWare network configuration is maybe incorrect.

Grateful for any pointers.

Cheers

Pete

Edit: I should also perhaps mention that i've noticed that the SQL Servers we have in our VM environment are complaining about latency when accessing the TempDB. We're seeing event ID 833 being reported by SQL on a fairly regular basis. Unfortunately one of these is a production system which means I can't take the Server down just now - but the plan is to move TempDB to another volume when a suitable window becomes available. Currently TempDB is on C: which still suggests we have some latency on the OS volumes. I've not yet finished distributing the operating systems over the 4 new storage pools so maybe this issue will die down once I have.

lando_uk · ‎10-07-2011

The issue we have encountered with VAAI is kernal latency within ESXI when you're doing a storage vmotion between volumes on the same cluster.

i.e. a 2 node cluster with 2 volumes that are load balanced across the cluster. Do a storage vmotion from datastore1 to datastore2 - everything looks great from the outside, it whizzes along. But when you look closer, inside esxtop, you'll see bad kernal latency on the LUN that your copying to. And that bad latency effects all VMs that are also on that volume for the period the storage vmotion is running, if you also do a snapshot of one of the other VMs, you'll probably see it pause for maybe 10 seconds and lose heartbeat.

The cause of this latency is that VAAI processes will totally saturate 1 NIC of each node, when you look at the performance logs of each node, you'll see the NIC hit 99% utilisation, it seems that the ALB that HP are using doesnt work very well as it wont use all 4 nics for VAAI. This problem gets worse with higher latency when using more nodes in the cluster.

In all their latest documentation for P4000/ESXi V5.0 they kind of mention that 10GB/s is recommended to reduce `latency issues` which to me is a kind of an admission that 1GB doesnt cut it in 2011 for Virtualised workloads...

We are in the process of evaluating the 10GB option, as we couldn't get 1GB to offer decent performance. Initial tests of 10GB seems to have fixed all our issues (snapshot 8GB VM took 10+ Mins on 1GB, reduced to under 2 mins with 10GB)

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: LSI_SAS Errors. Related to P4000 SAN?

LSI_SAS Errors. Related to P4000 SAN?