Status = storage overloaded

PCrid · ‎01-07-2011

Dear all,

we've had a bit of a wierd one today - we've had a couple of two-node p4300G2 management groups/clusters for a few months and they've never given any trouble. Quite randomly today we received alert emails from the system saying:

Management Group: 'Lefthand_MG1'; Storage Server: 'lefthand1' Status = storage overloaded. Value does not equal up.

We received a few over the space of about 10 minutes, at which point we noticed that the volumes on the cluster/mg had started resyncing. They resynced successfully, and onced they'd finished, just to be safe, we rebooted the errant node in case there was an underlying problem - once again it resynced quickly and with zero issues.

There was nothing going on that would have placed the cluster under undue load, and we've never seen the error before. Obviously I'm happy the issue resolved itself with zero downtime on our application servers, but my question is: any idea what might have caused this, and should I be worried?

Regards,
Pete

BennyO · ‎01-07-2011

That is typically indicative of a cluster node processing too many IOPs, or possibly a drive failing on a node in the cluster. I just had a drive fail in one of my nodes today and prior to the drive totally dying I received a few "Overload" messages. If you start to receive alot of them, there is defintely something wrong as it is not normal. If you have a support contract, I would contact HP and have them review the health of the nodes and the cluster.

DeeDubNZ · ‎01-11-2011

I am also suffering with these errors. I am unable to perform a reboot of the effected nodes due to a change freeze being in place at the moment. I will try this when i am able. thanks for posting your result.

I have a case logged with HP support to look into this as it effects multiple nodes at different times.

My environment has a total of 5 managment nodes in a 19 node, 3 clusters managment group. Any one of the five nodes can give this error and start resyncing.

If i find anything interesting i will let you know.

DeeDubNZ · ‎01-25-2011

I can confirm that this issue was caused by a failing/faileded Hard Drive.

I am asuming that the drive had failed in such a way that it was causing Errors on SCSI bus.

The symptom of this was a very Latent storage NODE upto 3 seconds at times and also the consistant overloaded errors.
A Raid rebuild was triggered when it initially showed the disk issue (about 4 days after the Storage server overload messages started,)This rebuild process took 7 days longer that i it does with a normal failed disk event . That is when a disk drops off and is instantly flagged as failed thus being removed instantly by the software.

In future i will be paying close attention to the latency monitors as this type of failure takes a while before the Disk was flaged as needing to be removed.

Steve Burkett · ‎01-27-2011

I added a (HP Renew) P4500 G1 node to our existing 2 node cluster in the weekend and had nothing but troubles with it. The new node joined in fine, and ran perfectly up till the point that I had left the office and got on the train home. It then promptly started complaining about 'Storage System is Overloaded' before dying completely and going offline. Had to return to the office to hard power the thing off and back on, it was ok again for another 4 hours before doing the same thing. Next morning went in to get it running again and double checked all the setttings looked ok, it survived the four or five hours that I stayed around, wasn't till I had got home that it fell over again.

Logged a case with HP who said the logs were showing the disk controller had suffered lockups and advised to get the firmwares up to date. So updated the Disk Backplane from 2.00 to 2.02, the P400 disk controller from 7.08 to 7.22, the disks from HPD4 to HPD6 and the motherboard BIOS up to the latest available. Also installed a couple of patches to SAN/IQ.

Thing lasted till the next afternoon before Overloading again, and then suffering a 2 disk offline/failed warning which was new. Disks came back online after power cycling and choosing F2 in the disk controller BIOS to accept the data loss. It did this a few more times that night.

At which point HP dispatched an engineer with a new disk array backplane and a new P400. The new P400 controller was DOA, so had to wait for a second replacement P400 to be couriered in. Finally got them installed and updated with the latest firmware and she's currently running ok!

Phew! Fingers crossed it stays like that.

Ivanbre · ‎02-02-2011

Steve,

Did you remove the NODE from the cluster when you DID f2? Did you do a placeholder check?

Ready my blog about it:
http://blog.ivangole.com/?cat=9

You will get data corruption on your LUNS if you don't watch out.

Ivanbre · ‎02-02-2011

Sorry wrong link

http://blog.ivangole.com/?p=163

use only as last resort. Follow my steps there otherwise you will get data corruption.

Steve Burkett · ‎02-03-2011

Can't say we did remove the node from the management group, no. Really wish Support would have pointed this out to be honest, they were quite vague as to how to recover the node, just saying we should delete Logical Drive 2 and recreate it to provide a logical drive which can be used for storing data again. Didn't give us any info on what 'logical drive 2' was referring to, or how we go about recreating it!

But saying that, haven't as yet seen anything bad happening (in fact it's been running well for about a week now) so maybe we dodged the bullet.

But thanks for the info, definitely know for next time now!

Ivanbre · ‎02-03-2011

Hi Steve,

A L3 engineer helped me out with this problemen from the US. These guys were from orginal lefthand and those guys rock.

As stated in the blog, all my vm's went down when i did that so i don't know your infrastructure but it wise to recheck it. The problem is that on one node the data was newer then on the rebooted when. When i did a f2 and everything started syncing again i was like ok pfew but then everything was kinda destroyed. I got 60Vm's down in about 5 minutes. Then i did a storagestop on the node remove from cluster and i had the 50% luck that of the 60VM's 59 booted without problem. So what happened? When you do a f2 you basically say i accept dataloss. But your node what is running has newer data on it. then when the node comes back online you can imagine what happens....So basically the remove - make placeholder - restriped - procedure saved me. This is the right way to handle and confirmed by L3 engineer.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Status = storage overloaded

Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded

Re: Status = storage overloaded