topic Re: MSA2212fc strange failure in Disk Enclosures

MSA2212fc strange failure

Ivan Kuznetsov — Tue, 19 Jan 2010 15:22:47 GMT

Hello

One of our client has two-node cluster (Oracle RAC under RHEL4) using MSA2212fc disk array as shared storage/voting disk. MSA has 2 controllers installed. Each controller connected with FC link to each node. We configured RAID10 of 10 HDDs and 2 HDDs are global hotspare (total 12 SAS dual-port HDDs).
The cluster works fine for ~1 year (24x7x365) but once failed. Linux on both nodes shows that MSA become unaccessible via both pathes:

Jan 10 06:15:50 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:50 ctms1 kernel: end_request: I/O error, dev sdb, sector 3936599
Jan 10 06:15:50 ctms1 kernel: device-mapper: dm-multipath: Failing path 8:16.
Jan 10 06:15:50 ctms1 multipathd: 8:16: mark as failed
Jan 10 06:15:50 ctms1 multipathd: mpath3: Entering recovery mode: max_retries=18
Jan 10 06:15:50 ctms1 multipathd: mpath3: remaining active paths: 0
Jan 10 06:15:51 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:51 ctms1 kernel: end_request: I/O error, dev sdb, sector 145185607
Jan 10 06:16:01 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000

The cluster tried to reboot themself but both nodes hangs on startup. It was early morning of holiday, the load was minimal, there was no duty tech. personel on customer site. When the site administrator come on he turns off and on the hardware, the cluster starts successfully. No data lost but the application was off-line for some hours

The customer asks us to diagnose the problem and prevent such failures in future.

MSA controller shows strange log records (see the attached file; time at controller was not acurate, the difference is ~4 min). It looks like all the HDDs are simultaneosly failed, but it is unreal. The array has two identical controllers, all the drives are dual-ported

Any ideas will help us greatly

Regards, Ivan Kuznetsov
SOLVO ltd.

Re: MSA2212fc strange failure

Diego Salim de Oliveira — Tue, 19 Jan 2010 19:59:01 GMT

Hello Ivan,

For some reason I could not download you log files, so, I didn't read that.

But, one thing that can cause an error in multiple disks at the same time is an ambiental error (cooling problem in the customer data center).

The hard disk drives are one of the most affected itens by this kind of problem. Which can result or not in data loss.

Do you have any log in the msa or the servers that indicates any cooling problem (excessive heat)?
Do you have any information about problems in the customer data center?

Re: MSA2212fc strange failure

Ivan Kuznetsov — Tue, 19 Jan 2010 21:38:25 GMT

Hello!

Temperature in the data room was normal. Both the cluster nodes are HP DL360 with good enviromental control too. Customer has a number of other servers and data hardware at data room too - there was no any alarm.

Power comes from two independent UPS. Both are monitored and are in good conditions

All HDDs were not failed actually - it just seems so. Being manually switched off and on the MSA started successfully, all the 12 drives go up and running without errors

Most probably the problem was in malfuncion of some part of MSA which is common for any HDD but is not monitored. But MSA hardware has 2x redundancy... I do not beleve that two or more circuits fails at the same time. Should be exactly one cause

Regards, Ivan