- Community Home
- >
- Storage
- >
- Entry Storage Systems
- >
- Disk Enclosures
- >
- Re: MSA2212fc strange failure
Disk Enclosures
1753318
Members
6076
Online
108792
Solutions
Forums
Categories
Company
Local Language
юдл
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
юдл
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-19-2010 07:22 AM
тАО01-19-2010 07:22 AM
MSA2212fc strange failure
Hello
One of our client has two-node cluster (Oracle RAC under RHEL4) using MSA2212fc disk array as shared storage/voting disk. MSA has 2 controllers installed. Each controller connected with FC link to each node. We configured RAID10 of 10 HDDs and 2 HDDs are global hotspare (total 12 SAS dual-port HDDs).
The cluster works fine for ~1 year (24x7x365) but once failed. Linux on both nodes shows that MSA become unaccessible via both pathes:
Jan 10 06:15:50 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:50 ctms1 kernel: end_request: I/O error, dev sdb, sector 3936599
Jan 10 06:15:50 ctms1 kernel: device-mapper: dm-multipath: Failing path 8:16.
Jan 10 06:15:50 ctms1 multipathd: 8:16: mark as failed
Jan 10 06:15:50 ctms1 multipathd: mpath3: Entering recovery mode: max_retries=18
Jan 10 06:15:50 ctms1 multipathd: mpath3: remaining active paths: 0
Jan 10 06:15:51 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:51 ctms1 kernel: end_request: I/O error, dev sdb, sector 145185607
Jan 10 06:16:01 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
The cluster tried to reboot themself but both nodes hangs on startup. It was early morning of holiday, the load was minimal, there was no duty tech. personel on customer site. When the site administrator come on he turns off and on the hardware, the cluster starts successfully. No data lost but the application was off-line for some hours
The customer asks us to diagnose the problem and prevent such failures in future.
MSA controller shows strange log records (see the attached file; time at controller was not acurate, the difference is ~4 min). It looks like all the HDDs are simultaneosly failed, but it is unreal. The array has two identical controllers, all the drives are dual-ported
Any ideas will help us greatly
Regards, Ivan Kuznetsov
SOLVO ltd.
One of our client has two-node cluster (Oracle RAC under RHEL4) using MSA2212fc disk array as shared storage/voting disk. MSA has 2 controllers installed. Each controller connected with FC link to each node. We configured RAID10 of 10 HDDs and 2 HDDs are global hotspare (total 12 SAS dual-port HDDs).
The cluster works fine for ~1 year (24x7x365) but once failed. Linux on both nodes shows that MSA become unaccessible via both pathes:
Jan 10 06:15:50 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:50 ctms1 kernel: end_request: I/O error, dev sdb, sector 3936599
Jan 10 06:15:50 ctms1 kernel: device-mapper: dm-multipath: Failing path 8:16.
Jan 10 06:15:50 ctms1 multipathd: 8:16: mark as failed
Jan 10 06:15:50 ctms1 multipathd: mpath3: Entering recovery mode: max_retries=18
Jan 10 06:15:50 ctms1 multipathd: mpath3: remaining active paths: 0
Jan 10 06:15:51 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
Jan 10 06:15:51 ctms1 kernel: end_request: I/O error, dev sdb, sector 145185607
Jan 10 06:16:01 ctms1 kernel: SCSI error : <0 0 1 1> return code = 0x20000
The cluster tried to reboot themself but both nodes hangs on startup. It was early morning of holiday, the load was minimal, there was no duty tech. personel on customer site. When the site administrator come on he turns off and on the hardware, the cluster starts successfully. No data lost but the application was off-line for some hours
The customer asks us to diagnose the problem and prevent such failures in future.
MSA controller shows strange log records (see the attached file; time at controller was not acurate, the difference is ~4 min). It looks like all the HDDs are simultaneosly failed, but it is unreal. The array has two identical controllers, all the drives are dual-ported
Any ideas will help us greatly
Regards, Ivan Kuznetsov
SOLVO ltd.
2 REPLIES 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-19-2010 11:59 AM
тАО01-19-2010 11:59 AM
Re: MSA2212fc strange failure
Hello Ivan,
For some reason I could not download you log files, so, I didn't read that.
But, one thing that can cause an error in multiple disks at the same time is an ambiental error (cooling problem in the customer data center).
The hard disk drives are one of the most affected itens by this kind of problem. Which can result or not in data loss.
Do you have any log in the msa or the servers that indicates any cooling problem (excessive heat)?
Do you have any information about problems in the customer data center?
For some reason I could not download you log files, so, I didn't read that.
But, one thing that can cause an error in multiple disks at the same time is an ambiental error (cooling problem in the customer data center).
The hard disk drives are one of the most affected itens by this kind of problem. Which can result or not in data loss.
Do you have any log in the msa or the servers that indicates any cooling problem (excessive heat)?
Do you have any information about problems in the customer data center?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-19-2010 01:38 PM
тАО01-19-2010 01:38 PM
Re: MSA2212fc strange failure
Hello!
Temperature in the data room was normal. Both the cluster nodes are HP DL360 with good enviromental control too. Customer has a number of other servers and data hardware at data room too - there was no any alarm.
Power comes from two independent UPS. Both are monitored and are in good conditions
All HDDs were not failed actually - it just seems so. Being manually switched off and on the MSA started successfully, all the 12 drives go up and running without errors
Most probably the problem was in malfuncion of some part of MSA which is common for any HDD but is not monitored. But MSA hardware has 2x redundancy... I do not beleve that two or more circuits fails at the same time. Should be exactly one cause
Regards, Ivan
Temperature in the data room was normal. Both the cluster nodes are HP DL360 with good enviromental control too. Customer has a number of other servers and data hardware at data room too - there was no any alarm.
Power comes from two independent UPS. Both are monitored and are in good conditions
All HDDs were not failed actually - it just seems so. Being manually switched off and on the MSA started successfully, all the 12 drives go up and running without errors
Most probably the problem was in malfuncion of some part of MSA which is common for any HDD but is not monitored. But MSA hardware has 2x redundancy... I do not beleve that two or more circuits fails at the same time. Should be exactly one cause
Regards, Ivan
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
News and Events
Support
© Copyright 2024 Hewlett Packard Enterprise Development LP