Email Subscription Notifications Suspended Temporarily
We are in the process of making navigation in the Servers and Operating Systems forums simpler and more direct. While doing this, we have to temporarily suspend email notifications for subscriptions. If you are subscribed to one or more discussion boards or blogs in the community, please check them daily to see new content. Notifications will be turned back on in a few days. We apologize for any inconvenience this may cause. Thanks, Warren_Admin
StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

LSI_SAS Errors. Related to P4000 SAN?

SOLVED
Go to solution
Peter J West
Frequent Advisor

LSI_SAS Errors. Related to P4000 SAN?

Hi,

 

We have a new Virtual Environment and SAN which consists of 3 x HP DL380 G7 Servers running VMWare vSphere 4 and 6 x HP P4500 G2 Storage Nodes.  The nodes are configured as a multi-site SAN with 3 nodes at each site.

 

So far i've been impressed with the performance but more recently we've spotted a strange performance issue whereby the Virtual Servers (which are running Server 2008 R2) will freeze for a short period.  They then continue as if nothing happened, but during the down phase they're not contactable and it's causing some strange issues with our applications.

 

Digging through the logs it looks like the issue relates to an error in the event log which says:

 

Reset to device, \Device\RaidPort0, was issued

Source: LSI_SAS

 

In this case i'm guessing RaidPort0 is the main operating system volume.  The current configuration for OS installation sis that we advertise an 800GB volume from the SAN to all 3 of the ESX Servers (we have to do this for vMotion to work) and then we create the VMDK file for the OS within this volume.  So far we have around 13 Servers all running from this volume, and my gut feeling is that perhaps this is overloading the volume with IOPS.


So, a few questions:

 

1. Is the way we're 'slicing' the disk the best way of doing it - or should we be using smaller volumes and spreading the OS volumes across them?  Is there an advantage to this?

 

2. Are there any published statistics on what the maximum IOP/s values are?  The graphs for performance in the CMC are great but they don't really mean a great deal if you don't know what a 'bad' value is.

 

3. Does anyone have any more idea on this issue and what we can do to resolve it.  I've checked across a number of the Servers and they all seem to have similar errors despite being on different physical hosts.

 

I also found mention of the above issue and a patch was mentioned but that related to SANiQ version 8 and all of our storage nodes are running the latest version of SANiQ.  I've also installed the appropraite Solution Pack on each of the VM's which also includes the MPIO component.

 

Grateful for any suggestions.

 

Pete

 

 

15 REPLIES
RonsDavis
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

In VMware's documentation there is a little checklist you should go through when setting up iSCSI. 

http://www.vmware.com/pdf/vsphere4/r41/vsp_41_iscsi_san_cfg.pdf

In this version that's on page 99.

The item that concerns me, and may concern you is 

Set Disk.UseDeviceReset to 0

By default this is set to 1, the other option they list is Set Disk.UseLunReset to 1.

If I'm reading correctly, you can either reset the device, or the LUN. In the case of a P4000 hosting multiple LUNs, you don't want to send Device resets, since that may affect multiple LUNs. 

 

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

So basically change UseDeviceReset to 0 for each of the ESX hosts?  UseLUNReset should be left at it's default of 1.

 

I'll give that a try.

 

Thanks

 

Pete

 

lando_uk
Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Use smaller datastores and balance them across the cluster. If you just have 1 large VMFS then you're limiting yourself to 1GB/s for all of those VMs, where as if you have 3 nodes at each site, create 3 VMFS volumes so that each node holds a gateway, potentially giving your esxi 3GB/s to play with.

 

You still might see pauses and slowdowns when doing storage vmotions, but that's the nature of the beast.

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Thanks Lando,

 

In terms of storage throughput though i'd hope we're getting > 1gbp/s.  We have 3 NIC's in a round-robin configuration which means our theoretical limit for storage traffic is 3gbp/s for each of the ESX Servers?

 

At the moment the VMFS volume for operating system files is around 800GB, but i've checked back and noticed we've got these LSI_SAS errors occurring on Servers going back as far as June 2011.  Back at this time we only had one or two Servers running in the virtual environment.

 

Strangely we're not seeing massive performance issues either at VM level or within the SAN, but we do have at least 2 hosts which seem to 'lock-up' occasionally and this is giving me cause for concern.

 

Thanks for the comments so far - they're really appreciated.

 

Kind regards

 

Pete

 

lando_uk
Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Your lefthand nodes have only 2 x 1GB/s NICs.   If your VMs are all on a single VMFS then the limit for that datastore is 1GB/s, even if each of your hosts are using 3 x 1GB/s using round robin.  This is because your esxi host can only read and write to the p4500 node that your VMFS is located on (its gateway)  ESXI doesnt have a DSM driver like Windows does, so it cant read from all the nodes in the cluster at the same time.

 

Even though each node has 2 nics, you can still only really utilise 1 when using Network Raid 10, as it uses the other nic for grabbing the other chunks of data from the other nodes.

 

A simple test you can do to see this, is run a iometer test from one VM in datastore-a, do a sequential read. If only one VM is running in that datastore-a you'll see about 100MB/s transfer.  Fire up another VM in the same datastore and run the same test at the same time, and you'll see both VMs run at about 50MB/s, do a 3rd VM and it'll drop down to 33MB/s and so on....

 

So when you only have 1 datastore, a single VM within that datastore can potentially impact all the other VMs.

 

So if you have a 3 node cluster, create 3 datastores and divide your VMs up, seperate out heavy IO VMs like SQL servers and dont have them on the same datastore.

 

With multiple datastores on the same cluster there are other issues to consider (VAAI issues), but that is a whole other story!

 

All this might not fix your problem, but its good practice and will improve performance.

 

Bottom line is, to get decent performance from Lefthand and VMware you have to use 10GB/s nics.

 

 

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Hi Lando,

 

Thanks for explaining that to me - it makes a lot of sense.

 

Currently the LUN defined for the operating system installed is around 800GB and we have 14 virtual machines all using this single volume for their operating systems.

 

This clearly isn't good and your explanation gives me a clear understanding of what the problem is.

 

So, my plan is to start by creating 4 smaller volumes, each of 240GB.  We'll then have no more than 4 VM's using each for their operating systems but at the same time we'll consider the storage footprint of each Server.

 

We did look at the 10gbp/s versions of the Lefthand solution but the cost of 10gbp/s switching gear was just prohibitively expensive for us at the time the project was going ahead.

 

My understanding is that the current configuration should be fine, but we just need to slice the disk differently and not load all the OS installs onto a single volume.

 

Pete

 

Bart_Heungens
Honored Contributor

Re: LSI_SAS Errors. Related to P4000 SAN?

Hi,

 

Can you specify a little more around issues with VAAI? I have done tests with it and they were really satisfying... Never had issues with... Deploying templates goes much faster than without VAAI enabled...

 

I can agree completely on the fact to have more than 1 VMFS volumes... If you enable Load Balancing correctly, you can go up to 2 Gb/s since it will spread the load accross the 2 NICs.. If you have at leastmultiple volumes and VMs off course...

 

Kr,

Bart

--------------------------------------------------------------------------------
If my post was useful, clik on my KUDOS! "White Star" !
My blog: http://blog.bitcon.be
Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Stupid question time maybe?  What's VAAI?

 

Also, with regards SAN throughput. Within vmWare we have a vSwitch defined just for SAN traffic and i've then pooled 3 vmNics to that switch.  The switch failover order on each NIC is overridden and an active adapter is manually specified.

 

My understanding was that this configuration will give us a combined throughput of 3gbp/s and also fault tolerance.  The failure of any 1 of the 3 NIC paths shouldn't cause the vmWare host to lose network connectivity.

 

I'm now wondering if my vmWare network configuration is maybe incorrect.


Grateful for any pointers.

 

Cheers

 

Pete

 

Edit: I should also perhaps mention that i've noticed that the SQL Servers we have in our VM environment are complaining about latency when accessing the TempDB.  We're seeing event ID 833 being reported by SQL on a fairly regular basis.  Unfortunately one of these is a production system which means I can't take the Server down just now - but the plan is to move TempDB to another volume when a suitable window becomes available.  Currently TempDB is on C: which still suggests we have some latency on the OS volumes.  I've not yet finished distributing the operating systems over the 4 new storage pools so maybe this issue will die down once I have.

lando_uk
Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

The issue we have encountered with VAAI is kernal latency within ESXI  when you're doing a storage vmotion between volumes on the same cluster.

 

i.e. a 2 node cluster with 2 volumes that are load balanced across the cluster.  Do a storage vmotion from datastore1 to datastore2 - everything looks great from the outside, it whizzes along. But when you look closer, inside esxtop, you'll see bad kernal latency on the LUN that your copying to. And that bad latency effects all VMs that are also on that volume for the period the storage vmotion is running, if you also do a snapshot of one of the other VMs, you'll probably see it pause for maybe 10 seconds and lose heartbeat.

 

The cause of this latency is that VAAI processes will totally saturate 1 NIC of each node, when you look at the performance logs of each node, you'll see the NIC hit 99% utilisation, it seems that the ALB that HP are using doesnt work very well as it wont use all 4 nics for VAAI.  This problem gets worse with higher latency when using more nodes in the cluster.

 

In all their latest documentation for P4000/ESXi V5.0 they kind of mention that 10GB/s is recommended to reduce `latency issues` which to me is a kind of an admission that 1GB doesnt cut it in 2011 for Virtualised workloads...

 

We are in the process of evaluating the 10GB option, as we couldn't get 1GB to offer decent performance. Initial tests of 10GB seems to have fixed all our issues (snapshot 8GB VM took 10+ Mins on 1GB, reduced to under 2 mins with 10GB)

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Despite spreading out VM's over mutliple OS VMFS volumes we are still seeing a range of event's in the event logs.  These range from the LSI_SAS errors mentioned above to others pertaining to MPIO problems and others.

 

I've now opened a case with HP Support as I don't feel we're a particularly large environment and we shouldn't be seeing these issues.

 

I'll report back with more news on what we're doing just in case it helps anyone else solve their issues.

 

Pete

 

Edit: Just to clarify.  The new configuration is 5 x 250GB VMFS volumes and on each of these we have no more than 3 virtual servers.  Typically we allocate 60GB for the OS of each VM.

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Just a further update.

 

We had a live support session yesterday and as a result of this it was recommended that we turn on flow control on the storage nodes and switch ports.

 

The guy doing the support session mentioned that our IOP load on the SAN wasn't significantly high but felt that the enabling of flow control might be a good starting point.

 

We've done that today and will monitor over the coming days.  I'm not that hopeful that it'll fix the issue but we'll see.

 

One thing I have noticed was that when looking at the storage latency performance graph there seems to be a lot of high values reported by HBA33 which is a physical iSCSI port on the Broadcom card.  This struck me as a little strange as i'd have expected the latent HBA to be HBA37 which is the Softwared-based iSCSI Initiator.  Maybe this points to a problem with my configuration?

 

Aart Kenens
Advisor
Solution

Re: LSI_SAS Errors. Related to P4000 SAN?

Please use the software iscsi initiator.

Performance is better and it has proven itself.

 

Some users at vmware forums were complaining that iscsi luns got disconnected under heavy load. 

After they changed to swiscsi the problems were gone.

 

I too use the swiscsi instead of the broadcom hw iscsi.

 

regards,

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Thanks Aaart, i'll give that a try.

 

The max read and write latency figures for vmhba33 (which is the Broadcom) are 109,475 and 49,500 respectively.  These figures dwarf those of the iSCSI Software Initiatoy (vmhba37) which has values of 91 and 89.

 

Fingers crossed this change will make some impact on the issue.

Peter J West
Frequent Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

Hi again,

 

It's a little early to say for sure yet but at the moment the changing of the iSCSI configuration appears to have fixed the problem.

 

I'm now seeing max latency values of 50ms for both read and write operations - i'm not sure how this fits with recommendations but it does mean that all fo the events in the event logs on the servers have vanished.

 

I'm going to continue monitoring for now, but it looks very much like we've got the bottom of the issue.

 

Thanks

 

Pete

 

Aart Kenens
Advisor

Re: LSI_SAS Errors. Related to P4000 SAN?

I am glad you've sorted out the issue.

 

greetings,

 

Aart