Hello everybody,
We are facing a very strange issue with our 2 of our HP DL 380 servers. The two servers are build identically (same hardware, same OS). These servers have a HP Smart Array 5i
Controller with 2 logical drives: 1 RAID 1+0 composed 2 disks of 72.82Gb and a RAID 0 composed of 2 disks 148,6Gb.
For some unknown reasons, we are getting since several months repeatidly I/O errors on the RAID 0 on each of these 2 servers. We took several courses to fix the problem, alas without any success:
* As the disks were shown as failed (red blinking), we tough first of a hardware problem with the disk itself. We replace the failed disks with brand new disks several times, but the error keeps occurring.
* We then next thought that it could be a problem with the controller. Our HP technician ran various hardware diagnostic tools, but no error on the servers could be found.
* We upgrade the firmware of the raid controller. It gave us 2 months break, but these disks error occurred again yesterday.
When the disk fails, we are seeing the following entry in the messages.log:
kernel: cciss: cmd f70c90b4 has CHECK CONDITION byte 2
kernel: Buffer I/O error on device cciss/c0d1p1, logical block 191236552
kernel: lost page write due to I/O error on cciss/c0d1p1
kernel: Buffer I/O error on device cciss/c0d1p1, logical block 191236553
...
these messages go on until the file-system layer reports error, and finally the disk becomes unusable.
What's also interesting with this problem, it seems that it can be fixed just by rebooting the server, and re-creating the RAID 0 (erase + create) with the Smart Array utility.
We are quite puzzled by this issue. I am wondering if some of you had already faced similar problems, and how you managed to solve it. Please find below additional information about the servers, as well the relevant messages.log entries in attachment.
Thanks in advance,
Loic.
Linux version / distribution
----------------------------
SuSE Linux 9.2 (i586)
VERSION = 9.2
Linux mucinj01 2.6.8-24-smp #1 SMP Wed Oct 6 09:16:23 UTC 2004 i686 i686 i386 GNU/Linux
Array Controller
----------------
cciss0: HP Smart Array 5i Controller
Board ID: 0x40800e11
Firmware Version: 2.66
IRQ: 185
Logical drives: 2
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 304
Max # commands on controller since init: 384
Max SG entries since init: 31
Sequential access devices: 0
cciss/c0d0: 72.82GB RAID 5
cciss/c0d1: 293.61GB RAID 0