P4300 brings store down after 1 disk fails

Spravtek · ‎01-23-2012

Hello, first time poster here with a strange issue.

Last week 1 disk got status degraded - failed in a RAID 5 setup on 1 node (p4300), normally not a major problem when the disk is replaced asap.

But some 30 minutes after the disk failed these entries were seen in the manager info log:

DBD_EVENT:POST:type=STORE_LATENCY_STATUS_EXCESSIVE [latency='61.175',threshold='60.000'] 
DBD_EVENT:POST:type=STORE_LATENCY_STATUS_NORMAL
DBD_EVENT:POST:type=STORE_LATENCY_STATUS_EXCESSIVE [latency='60.461',threshold='60.000'] 
DBD_MANAGER_HEARTBEAT:bringing store down after 25.567 secs (nheartbeat_failure=0) 
DBD_MANAGER_HEARTBEAT:bringing store down after 25.576 secs (nheartbeat_failure=0)

The store was brought down, with all hell breaking loose after that ... Servers going down etc ...

Around that time, or little after, the dbd_store messaged that is got blocked for about 120sec ...

After a short time the store came back online from offline --> degraded --> ready

Now, something tells me this is not normal behaviour, we logged a case with HP support obviously, but I was wondering if anybody had this issue before?

I do need to say this node is still in SAN/IQ 9.0 ... But even so ...

Dirk Trilsbeek · ‎01-23-2012

do you have a failover manager for that particular management group? We had several disk failures over the past few months and right after two or three of them the affected node was offline for a minute or two. We have a setup with 4 nodes and a dedicated failover manager, so all volumes were still available. It seems that the raid controller sometimes takes some time to deal with a failed disk, doesn't react in time to requests and the SANiQ-Software takes the node offline due to too high latency.

If you only have two nodes in your setup and don't have a failover manager to provide quorum, the volumes would be unavailable for a short period of time. The same of course applies for all configurations with a single node.

Spravtek · ‎01-23-2012

Hi,

That's what I figured, failed to mention that this customer only has 1 node in the MG, so the failover is non existing.

I didn't expect/know that there would be downtime for the store though, that's a little strange for a storage node, that means they should never be bought in single node setup?

There is a second node on order, but even then, as you mentioned they definitely need a FOM. They were planning to use this new node in solo mode as well...

I'm glad you witnessed this behavior as well ... Are your nodes running 9.0 or 9.5?

Thanks again

Dirk Trilsbeek · ‎01-23-2012

we are running on 9.0. I can only assume that this behavior is in favor of setups with more than one node where there is no real danger in losing a node for a short period. As the network-raid spreads all accesses over all nodes in a management group, a node with a high latency is going to affect all volumes and all sessions, so the software decides to take the node offline to avoid clogging up the request queue. Sensible choice for setups with at least two nodes, fatal for setups without failover.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

P4300 brings store down after 1 disk fails

P4300 brings store down after 1 disk fails

Re: P4300 brings store down after 1 disk fails

Re: P4300 brings store down after 1 disk fails

Re: P4300 brings store down after 1 disk fails