Re: EXT3-fs error(device dm-6)

Qcheck · ‎08-06-2010

Good morning,

Just replaced the bad disk yesterday and still getting errors. I am getting the following errors in dmesg and still one of the file system is read-only mounted. Any ideas?

sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
Add. Sense: Internal target failure

Info fld=0x0
end_request: I/O error, dev sda, sector 235753109
Buffer I/O error on device dm-6, logical block 2704297
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704298
lost page write due to I/O error on dm-6
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
Add. Sense: Internal target failure

Info fld=0x0
end_request: I/O error, dev sda, sector 235752965
Buffer I/O error on device dm-6, logical block 2704279
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704280
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704281
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704282
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704283
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704284
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704285
lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 2704286
lost page write due to I/O error on dm-6
Aborting journal on device dm-6.
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
ext3_abort called.
EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
SCSI device sda: 859525120 512-byte hdwr sectors (440077 MB)
sda: Write Protect is off
sda: Mode Sense: 06 00 10 00
SCSI device sda: drive cache: write back w/ FUA

Qcheck · ‎08-09-2010

Can anyone please respond what is dm-6 error? How can I tell which disk is bad? We have 4 146 Gig drives with hardware RAID 5. We replaces disk in slot 2 which was bad last week. Now, still get errors in the log file that hardware failure. And now found that one of the file system is read-only mount. How can I tell which disk is failed form O/S side. I have no hardware monitor tools.

Matti_Kurkela · ‎08-09-2010

> We have 4 146 Gig drives with hardware RAID 5.
> How can I tell which disk is failed form O/S side. I have no hardware monitor tools.

Your question would be much easier to answer if you had given more information about your set-up:
- name and version of Linux distribution
- system manufacturer and model
- RAID hardware model (if applicable)

First, let's try to find the persistent device-mapper devicename that corresponds to /dev/dm-6:

ls -l /dev/dm-6 /dev/mapper/* /dev/md*

The device that has the same major and minor device numbers as /dev/dm-6 is the device you're looking for.

The next step would be to find out what does /dev/dm-6 do and which hardware-level devices are associated with it. By the error messages I assume /dev/sda is one of them; but are there others?

Possibly useful commands:
dmsetup table
dmsetup ls --tree
cat /proc/mdstat
pvs

True hardware RAID usually hides the actual physical disks: the only way to get information about the state of the disks is to ask the driver. Usually some RAID-manufacturer-specific diagnostic program is required to get the full report, but basic information may be available in the /proc filesystem. Look into /proc/scsi// or /proc/driver//.

For example, if it's a HP SmartArray hardware RAID which is controlled by the "cciss" driver module, then "cat /proc/driver/cciss/0" would display basic information about the first SmartArray controller on the system (controller 0).

If you had the "hpacucli" (HP Array Configuration Utility CLI) tool installed, the command "hpacucli controller all show config detail" would produce a more verbose report about the SmartArray controllers, including the state, model and serial numbers of all physical disks attached to them.

There's also the "Array Diagnostic Utility" which can produce an even more verbose report.

If you don't have any RAID diagnostic programs installed and cannot install them, the only way to identify a failed disk might be to look at the disk diagnostic LEDs in the server's front panel (if the RAID controller has such LEDs).

MK

MK

Qcheck · ‎08-10-2010

Dear MK,

Vow, Thank you so much for the response.

Basic environment info:
RHEL 5.1 with kernel 2.6.18-53.el5
SunFire X4150 and RAID is done at BIOS level.
(4) 146 HD with RAID 5.

From the BIOS, I can see all four disk drives in a solid state. And also from the /proc/scsi/scsi file, I can see all four disks as listed below:

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Sun Model: sys_root Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 01 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST914602SSUN146G Rev: 0603
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 01 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST914602SSUN146G Rev: 0603
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 01 Id: 02 Lun: 00
Vendor: SEAGATE Model: ST914602SSUN146G Rev: 0603
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 01 Id: 03 Lun: 00
Vendor: SEAGATE Model: ST914602SSUN146G Rev: 0603
Type: Direct-Access ANSI SCSI revision: 05

And from the commands you asked to try:
[root@mtstalpd-rac3 sg]# ls -l /dev/dm-6 /dev/mapper/* /dev/md*
ls: /dev/dm-6: No such file or directory
crw------- 1 root root 10, 63 Aug 9 09:01 /dev/mapper/control
brw-rw---- 1 root disk 253, 0 Aug 9 13:01 /dev/mapper/VolGroup00-LogVol00
brw-rw---- 1 root disk 253, 9 Aug 9 09:01 /dev/mapper/VolGroup00-LogVol01
brw-rw---- 1 root disk 253, 4 Aug 9 13:01 /dev/mapper/VolGroup00-LogVol02
brw-rw---- 1 root disk 253, 2 Aug 9 13:01 /dev/mapper/VolGroup00-LogVol03
brw-rw---- 1 root disk 253, 3 Aug 9 13:01 /dev/mapper/VolGroup00-LogVol04
brw-rw---- 1 root disk 253, 1 Aug 9 13:01 /dev/mapper/VolGroup00-LogVol05
brw-rw---- 1 root disk 253, 7 Aug 9 13:12 /dev/mapper/VolGroup00-oracleadminlv
brw-rw---- 1 root disk 253, 6 Aug 9 13:12 /dev/mapper/VolGroup00-oraclelv
brw-rw---- 1 root disk 253, 5 Aug 9 13:01 /dev/mapper/VolGroup00-standby
brw-rw---- 1 root disk 253, 8 Aug 9 13:01 /dev/mapper/VolGroup00-swaplv
brw-r----- 1 root disk 9, 0 Aug 9 13:01 /dev/md0
[root@mtstalpd-rac3 sg]# dmsetup table
VolGroup00-standby: 0 134217728 linear 8:2 79692160
VolGroup00-LogVol05: 0 8388608 linear 8:2 16777600
VolGroup00-oraclelv: 0 33554432 linear 8:2 213909888
VolGroup00-oraclelv: 33554432 4194304 linear 8:2 515899776
VolGroup00-LogVol04: 0 16777216 linear 8:2 33554816
VolGroup00-LogVol03: 0 8388608 linear 8:2 25166208
VolGroup00-LogVol02: 0 29360128 linear 8:2 50332032
VolGroup00-LogVol01: 0 134217728 linear 8:2 381682048
VolGroup00-oracleadminlv: 0 33554432 linear 8:2 247464320
VolGroup00-LogVol00: 0 16777216 linear 8:2 384
VolGroup00-swaplv: 0 100663296 linear 8:2 281018752
[root@mtstalpd-rac3 sg]# dmsetup ls --tree
VolGroup00-standby (253:5)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol05 (253:1)
Ã¢Ã¢ (8:2)
VolGroup00-oraclelv (253:6)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol04 (253:3)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol03 (253:2)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol02 (253:4)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol01 (253:9)
Ã¢Ã¢ (8:2)
VolGroup00-oracleadminlv (253:7)
Ã¢Ã¢ (8:2)
VolGroup00-LogVol00 (253:0)
Ã¢Ã¢ (8:2)
VolGroup00-swaplv (253:8)
Ã¢Ã¢ (8:2)
[root@mtstalpd-rac3 sg]# cat /proc/mdstat
Personalities :
unused devices:
[root@mtstalpd-rac3 sg]# pvs
PV VG Fmt Attr PSize PFree
/dev/sda2 VolGroup00 lvm2 a- 409.72G 161.72G
[root@mtstalpd-rac3 sg]#

I didn't get any information. So is there a possibility that disk controller is bad?

Thank you so much for your valuable time and I really appreciate.

Matti_Kurkela · ‎08-10-2010

Thanks for the information; now I'm starting to see what's going on.

The "dmesg" command lists the kernel message buffer. Old messages will only be removed from the buffer when overwritten by newer messages. The size of the message buffer used to be about 16 KB, but it may have been increased in newer kernels. When you run "dmesg", you get everything that's in the buffer - whether the messages are new or old.
If you want to clear the message buffer (to make it easier to see which messages are new), run "dmesg -c".

The last four messages seem to indicate a state change of some sort on /dev/sda. If that's the point where you hot-swapped the bad disk, it might have caused these messages.

If no new "Buffer I/O error" messages appear after the lines:

SCSI device sda: 859525120 512-byte hdwr sectors (440077 MB)
sda: Write Protect is off
sda: Mode Sense: 06 00 10 00
SCSI device sda: drive cache: write back w/ FUA

then your RAID5 set is probably OK now.

In RHEL 5.1, the dm-* devices no longer exist in /dev, but the kernel error messages still refer to them. No matter: the in "dm-" is the minor number of the respective device-mapper device. Based on your /dev/mapper/* listing, the major number of the device-mapper subsystem is 253, so the problematic device is major 253 minor 6, or /dev/mapper/VolGroup00-oraclelv (also known as /dev/VolGroup00/oraclelv).

>ext3_abort called.
>EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal
>Remounting filesystem read-only

These messages indicate that an error was detected at the filesystem level, and the filesystem was switched to read-only mode to protect the data. You can try to switch it back to read-write mode with:

mount -o remount,rw /dev/VolGroup00/oraclelv

but usually the system will block this command until the filesystem is checked first.

To check the filesystem, you must stop the applications using it (i.e. Oracle) and unmount it:

umount /dev/VolGroup00/oraclelv
fsck -C 0 /dev/VolGroup00/oraclelv

If the filesystem check finds no errors (or can fix all the errors it can find), you can again mount the filesystem and resume using it:

mount /dev/VolGroup00/oraclelv

MK

MK

Qcheck · ‎08-10-2010

MK,

Thank you again for your kind and detail response.

Yes, you are right, it was /oracle file system which is having the issue. I did run the fsck on /oracle twice yesterday. But whenever the oracle starts it read-only mount comes back. So I am guessing the disk controller itself must be bad.

Again, after I saw ur response, I did fsck.

umount /dev/VolGroup00/oraclelv
fsck -C 0 /dev/VolGroup00/oraclelv
(fixed one journal)
mount /dev/VolGroup00/oraclelv

Now, I am able to touch but didn't start the oracle yet as DBA is not here.

I/O errors we are getting even after the following:
SCSI device sda: 859525120 512-byte hdwr sectors (440077 MB)
sda: Write Protect is off
sda: Mode Sense: 06 00 10 00
SCSI device sda: drive cache: write back w/ FUA

So do you think RAID 5 was corrupted?

But when I called SUN they think it is the bad disk. But like I said I can see all four disks. So I am guessing it is something to do with the controller.

Again, thank you so much for teaching me and explaining. So nice of you.

Marwen · ‎01-25-2012

Hi there ,

It should be much more easy to identify the device ausing the issue by just checking the Array Diagnostic Utility logs :

example :

Symptom :

lost page write due to I/O error on dm-6
Buffer I/O error on device dm-6, logical block 8201
lost page write due to I/O error on dm-6
REISERFS: abort (device dm-6): Journal write error in flush_commit_list
REISERFS: Aborting journal for filesystem on dm-6

Matching error on ADU :

Smart Array P400 in Embedded Slot : Storage Enclosure at Port 1I : Box 1 : Drive Cage on Port 1I : Physical Drive 1I:1:4 : Monitor and Performance Parameter Control

   Bus Faults                           8452 (0x2104)
   Hot Plug Count                       0x2104
   Track Rewrite Errors                 0x2902
   Write Errors After Remap             0x2102
   Background Firmware Revision         0x0848
   Media Failures                       0x2102
   Hardware Errors                      0x2102
   Aborted Command Failures             0x2102
   Spin Up Failures                     0x2102
   Bad Target Count                     8450 (0x2102)
   Predictive Failure Errors            0x2104

The Hard drive on port 1I:1:4 is the root cause of the bus faults /time outs , usually the a Firmware upgrade of the Hard drive itself solve the problem if not a classic HW change will fix it definitely.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: EXT3-fs error(device dm-6)

EXT3-fs error(device dm-6)