Operating System - Linux
1751980 Members
4838 Online
108784 Solutions
New Discussion юеВ

repeating cciss error on HP DL 380 with SuSE 9.2 professional

 
Loic Domaigne
Occasional Advisor

repeating cciss error on HP DL 380 with SuSE 9.2 professional

Hello everybody,

We are facing a very strange issue with our 2 of our HP DL 380 servers. The two servers are build identically (same hardware, same OS). These servers have a HP Smart Array 5i
Controller with 2 logical drives: 1 RAID 1+0 composed 2 disks of 72.82Gb and a RAID 0 composed of 2 disks 148,6Gb.

For some unknown reasons, we are getting since several months repeatidly I/O errors on the RAID 0 on each of these 2 servers. We took several courses to fix the problem, alas without any success:

* As the disks were shown as failed (red blinking), we tough first of a hardware problem with the disk itself. We replace the failed disks with brand new disks several times, but the error keeps occurring.

* We then next thought that it could be a problem with the controller. Our HP technician ran various hardware diagnostic tools, but no error on the servers could be found.

* We upgrade the firmware of the raid controller. It gave us 2 months break, but these disks error occurred again yesterday.

When the disk fails, we are seeing the following entry in the messages.log:

kernel: cciss: cmd f70c90b4 has CHECK CONDITION byte 2
kernel: Buffer I/O error on device cciss/c0d1p1, logical block 191236552
kernel: lost page write due to I/O error on cciss/c0d1p1
kernel: Buffer I/O error on device cciss/c0d1p1, logical block 191236553
...

these messages go on until the file-system layer reports error, and finally the disk becomes unusable.

What's also interesting with this problem, it seems that it can be fixed just by rebooting the server, and re-creating the RAID 0 (erase + create) with the Smart Array utility.

We are quite puzzled by this issue. I am wondering if some of you had already faced similar problems, and how you managed to solve it. Please find below additional information about the servers, as well the relevant messages.log entries in attachment.

Thanks in advance,
Loic.


Linux version / distribution
----------------------------
SuSE Linux 9.2 (i586)
VERSION = 9.2
Linux mucinj01 2.6.8-24-smp #1 SMP Wed Oct 6 09:16:23 UTC 2004 i686 i686 i386 GNU/Linux

Array Controller
----------------
cciss0: HP Smart Array 5i Controller
Board ID: 0x40800e11
Firmware Version: 2.66
IRQ: 185
Logical drives: 2
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 304
Max # commands on controller since init: 384
Max SG entries since init: 31
Sequential access devices: 0

cciss/c0d0: 72.82GB RAID 5
cciss/c0d1: 293.61GB RAID 0
7 REPLIES 7
Sandeepk_1
Advisor

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

SLES 9 with kernel 252 is up to date and should support your 5i controller very well. 252 is the latest errata kernel of SP3.
The driver hp provides is cpq_cciss-2.6.10-11.sles9.i586.rpm and could support their listed kernel 2.6.5-7.97
2.6.5-7.104
2.6.5-7.108
2.6.5-7.111.5
2.6.5-7.139
2.6.5-7.145
2.6.5-7.147
2.6.5-7.151
2.6.5-7.155.29
2.6.5-7.191 - (x86)
252 has more enhancement and bug fix so you can leave this drive alone.
Suppose you did install HP Proliant Support Pack that contains hpasm/hprsm/etc, ofcoz the driver you mentioned above is also included in this pack and will be installed into to the Linux with supported kernel.
you can check your cciss information by invoke rpm -qa|grep cciss or go to /proc/scsi/cciss/X and lspci for further verification.
Colin Topliss
Esteemed Contributor

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

Hi,

Had a similar problem on a DL380 (running SuSE9.0). Worked fine for days, then we'd get very similar errors before the filesystem reported I/O errors. Reboot, get the fiesystem fsck'd and back and it would be fine for a while before it would go again.

We ended up having the controller replaced (went through the disk replacement and firmware upgrade too). Since then, no problems at all....
Priyank Patel
New Member

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

We are seeing similar errors on our HP proliant servers.

Did the suggested solution of replacing the controllers rather than the drives work well for you ?

Thanks,
Juan Pablo Venegas
New Member

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

Hi...

I just have the same problem with Centos 4.5
We have being working just fine with the server for 3 months, but suddently start the problem with a disk fail on the RAID5, we replace the disk, it work for day next day failled 2 disk, the new replace and another one, so we restart and it worked again, but then finally fail... so we rebuild the RAID and restore the backup, we work just fine 1 day more.... now we are looking for what is wrong with the server?
Somebody have any idea what it could be causing this problem??? It appears to be a hardware failure wE REMOVE THE SERVER NOW....
Rob Leadbeater
Honored Contributor

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

Hi Juan Pablo,

Whilst your problem may be similar, it would be good if you can start your own thread, rather than continuing a 6 month old one.

That way if someone posts a solution to *your* problem, you can assign relevant points etc.

By all means reference this thread, in your post.

Cheers,

Rob
Colin Topliss
Esteemed Contributor

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

...to finish off what I started in the other post, yes - replacing the controller rather than the disks solved my particular problem.

What I can't tell you though is what firmware rev the controller was on at the time (so maybe replacing the controller with a 'new' one left me with a different rev). Try the latest firmware from the PSP before getting the controller replaced (HP will tell you to update the firmware anyway - standard initial reply)!

Colin.
Loic Domaigne
Occasional Advisor

Re: repeating cciss error on HP DL 380 with SuSE 9.2 professional

Time to close this thread and give points.