1839025 Members
2699 Online
110132 Solutions
New Discussion

Disk write errors

 
dawn_jose85
Frequent Advisor

Disk write errors

Hi
In my HP Proliant server , i have an issue .
The server is showing disk write errors .
I'm attaching the logs for that
OS is RHEL 5 .
May i get the reason and a solution for this issue
6 REPLIES 6
Matti_Kurkela
Honored Contributor

Re: Disk write errors

Looks like you're getting errors from more than one logical disk (cciss/c1d0p1,
cciss/c1d1p1, cciss/c1d2p1 and cciss/c1d3p1).

Is this related to your earlier post about hpacucli?

Because filesystem errors have been detected, you will probably have to run a full filesystem check to filesystems on those disks after the root cause is fixed. A reboot might do that automatically.

Is there a common element that would affect all the physical disks corresponding to those logical disks?

For example, if all the corresponding physical disks are in an external enclosure, you should check the health of the enclosure: are all the cables connected, power supplies OK/not OK, etc.

Try to reboot the system and pay attention to the BIOS boot messages: the SmartArray controller might print out informative error messages.

Also check the firmware versions. If you're not running the latest firmware, read the version history of the firmware package of your SmartArray controller model, going backwards from the latest version until you reach the version you've running now. If the version history indicates important fixes that would seem to be relevant to your current issue, consider updating the SmartArray firmware.


I once had a DL380 G5 with a SmartArray P800 and an external MSA50 enclosure. At boot time, the SmartArray displayed these messages:
-----
1777-Slot 4 Drive Array - Storage Enclosure Problem Detected
Port 1E: Box 1: Enclosure Processor Not Detected or Responding
Turn system and storage enclosure power OFF and turn them back ON to retry. If this error persists, upgrade the enclosure firmware or replace the I/O module.

1784-Slot 4 Drive Array - Drive Failure
The following disk drive(s) are failed and should be replaced:
Missing Port/Box 1: Bays 1,2,3,4,5,6,7,8,9,10
On-Line Spare Drive Failed
-----

In this case, all the disks of the external MSA50 enclosure were "failed" because of a problem in the enclosure itself. A visual inspection of the I/O module revealed the cause. See the attached picture of the I/O module and note the burned component in the foreground.

After the I/O module was replaced, it turned out the disks were OK. A full filesystem check was still required, because the I/O module had failed in mid-operation.

MK
MK
dawn_jose85
Frequent Advisor

Re: Disk write errors

Hi Thank you for your co-operation
This query is not about hpacucli issue .This issue is related to another server .
This server is directly connected to MSA20 Storage through SCSI cable . But there is no issue detected in any of the harddrive present in the storage . Can i suspect a possiblity of issue with storage controller .
Also Server is showing the erros as below

May 3 02:31:33 localhost kernel: Aborting journal on device cciss/c1d2p1.
May 3 02:31:33 localhost kernel: ext3_abort called.
May 3 02:31:33 localhost kernel: EXT3-fs error (device cciss/c1d2p1): ext3_journal_start_sb: Detected aborted journal
May 3 02:31:33 localhost kernel: Remounting filesystem read-only
May 3 02:31:33 localhost kernel: cciss: cmd ffff810037e87290 is reported invalid
May 3 02:31:33 localhost kernel: cciss: cmd ffff810037e87500 is reported invalid

Also

May 2 12:33:09 localhost kernel: EXT3-fs error (device cciss/c1d2p1) in ext3_ordered_commit_write: IO failure
May 2 12:33:09 localhost kernel: cciss: cmd ffff810037e80000 has CHECK CONDITION byte 2 = 0x5
May 2 12:33:09 localhost last message repeated 2 times
May 2 12:33:09 localhost kernel: ext3_abort called.
May 2 12:33:09 localhost kernel: EXT3-fs error (device cciss/c1d1p1): ext3_journal_start_sb: Detected aborted journal
May 2 12:33:09 localhost kernel: Remounting filesystem read-only
May 2 12:33:10 localhost kernel: cciss: cmd ffff810037e80270 is reported invalid
May 2 12:33:10 localhost kernel: cciss: cmd ffff810037e804e0 is reported invalid
May 2 12:33:11 localhost kernel: cciss: cmd ffff810037e80750 is reported invalid
dawn_jose85
Frequent Advisor

Re: Disk write errors

Hi,

Also i'm getting errors as below

May 2 11:27:14 localhost auditd[3157]: Audit daemon rotating log files
May 2 12:33:08 localhost kernel: cciss: cmd ffff810037e926f0 is reported invali
d
May 2 12:33:08 localhost kernel: Buffer I/O error on device cciss/c1d1p1, logic
al block 5557037

And

May 19 11:28:45 localhost kernel: audit: audit_backlog=321 > audit_backlog_limit=320
May 19 11:28:45 localhost kernel: audit: audit_lost=2 audit_rate_limit=0 audit_backlog_limit=320
May 19 11:28:45 localhost kernel: audit: backlog limit exceeded

dawn_jose85
Frequent Advisor

Re: Disk write errors

Hi,
how to check smart array firmware version
Matti_Kurkela
Honored Contributor

Re: Disk write errors

The SmartArray firmware version is displayed among the BIOS messages when booting the system. You can also see it with hpacucli:

# hpacucli
> controller all show config detail

[...all the information about the SmartArray you can hope for...]

MK
MK
Matti_Kurkela
Honored Contributor

Re: Disk write errors

(Oops, I hit Submit too soon...)

The audit error messages are probably because the system cannot write the audit logs to the disk. This is because your filesystems have switched to read-only mode (as indicated by the previous errors) because of SmartArray errors.

Yes, the failure of the controller is a possibility. Without knowing more about the controller's current state (model, firmware level, configuration, disks OK/NotOK) it's hard to say for sure.

MK
MK