Re: Failing disk in vg00, but stale ext in its mirror

R.O. · ‎06-18-2010

Hi,

I have a failed disk in vg00 (mirrored):

(dmesg excerpt)
LVM: Failed to automatically resync PV 1f004000 error: 5
SCSI: First party detected bus hang -- lbolt: 75942186, bus: 0
lbp->state: 5020
lbp->offset: f0
lbp->uPhysScript: 280000
From most recent interrupt:
ISTAT: 21, SIST0: 00, SIST1: 00, DSTAT: 84, DSPS: 00000010
lsp: 0000000000000000
lbp->owner: 0000000043522d00
bp->b_dev: 1f004000
scb->io_id: 22623e
scb->cdb: 28 00 00 c9 82 c0 00 02 00 00
lbolt_at_timeout: 75941886, lbolt_at_start: 75941886
lsp->state: 10d
scratch_lsp: 0000000043522d00
Pre-DSP script dump [0000000044012030]:
78347400 0000000a 78350800 00000000
0e000004 00280540 80000000 00000000
Script dump [0000000044012050]:
870b0000 002802d8 98080000 00000005
721a0000 00000000 98080000 00000001
SCSI: Resetting SCSI -- lbolt: 75942286, bus: 0
SCSI: Reset detected -- lbolt: 75942286, bus: 0

From event.log:

Summary:
Disk at hardware path 10/0.4.0 : Media failure

..and a dd to the disk gets hung (not able to finish or kill it):

HP:/#ps -ef|grep dd
7:19 dd if=/dev/rdsk/c0t4d0 of=/dev/null bs=1024

The diskinfo works for the failing disk:

HP:/#diskinfo -v /dev/rdsk/c0t4d0
SCSI describe of /dev/rdsk/c0t4d0:
vendor: SEAGATE
product id: ST19171W
type: direct access
size: 8886762 Kbytes
bytes per sector: 512
rev level: HP06
blocks per disk: 17773524
ISO version: 0
ECMA version: 0
ANSI version: 2
removable media: no
response format: 2
(Additional inquiry bytes: (32)41 etc

BUT, the stale extents are residing in the other disk of the mirror:

HP:/root#lvdisplay -v /dev/vg00/lvol8
--- Logical volumes ---
LV Name /dev/vg00/lvol8
VG Name /dev/vg00
LV Permission read/write
LV Status available/stale
Mirror copies 1
Consistency Recovery MWC
Schedule parallel
LV Size (Mbytes) 1376
Current LE 344
Allocated PE 688
Stripes 0
Stripe Size (Kbytes) 0
Bad block on
Allocation strict
IO Timeout (Seconds) default

--- Distribution of logical volume ---
PV Name LE on PV PE on PV
/dev/dsk/c0t5d0 344 344
/dev/dsk/c0t4d0 344 344

--- Logical extents ---
LE PV1 PE1 Status 1 PV2 PE2 Status 2
00000 /dev/dsk/c0t5d0 01824 current /dev/dsk/c0t4d0 01511 current
00001 /dev/dsk/c0t5d0 01825 current /dev/dsk/c0t4d0 01512 current
00002 /dev/dsk/c0t5d0 01826 current /dev/dsk/c0t4d0 01513 current
00003 /dev/dsk/c0t5d0 01827 current /dev/dsk/c0t4d0 01514 current
00004 /dev/dsk/c0t5d0 01828 current /dev/dsk/c0t4d0 01515 current
00005 /dev/dsk/c0t5d0 01829 current /dev/dsk/c0t4d0 01516 current

......

00100 /dev/dsk/c0t5d0 01924 stale /dev/dsk/c0t4d0 01611 current
00101 /dev/dsk/c0t5d0 01925 stale /dev/dsk/c0t4d0 01612 current
00102 /dev/dsk/c0t5d0 01926 stale /dev/dsk/c0t4d0 01613 current
00103 /dev/dsk/c0t5d0 01927 stale /dev/dsk/c0t4d0 01614 current
00104 /dev/dsk/c0t5d0 01928 stale /dev/dsk/c0t4d0 01615 current
00105 /dev/dsk/c0t5d0 01929 stale /dev/dsk/c0t4d0 01616 current
.....etc

If I reduce the failing disk, stale extents will remain in the "good" disk; if I reduce the "good" disk, I have the a "bad" disk. How can I solve this problem?
(It is a test system and I have an Ignite tape, so it is not critical)

Regards,

"When you look into an abyss, the abyss also looks into you"

Elmar P. Kolkman · ‎06-18-2010

Your problem with the stale extents is irritating, but I've seen and solved this before.
A bigger issue is: is c0t5d0 bootable ?
And are your first lvols (stand, root and swap) at the same physical extent numbers on your disks. Because the shown output has a differenc in PE numbers !

I would suggest: reboot the server from the alternate disk (c0t5d0), but only after making sure your backup is up-to-date... Then start removing c0t4d0 from the vg00 and replace it with a working new disk.
And then mirror in numerical order, not alphabetical order (lvol2 needs to be mirrored before lvol11 !)

Every problem has at least one solution. Only some solutions are harder to find.

vdf · ‎06-18-2010

The difference in PE number is due to the fact that I booted from alternate disk after lvsplitted the lvols and I did lvmerge using as "source_lvols" the "b" lvols. I splitted lvol3 the last one, hence the difference.

I have shutdown the system, removed c0t4d0 disk and now it drops me at bcheckrc prompt, because lvol8 (var) has I/O errors in metadata and "fsck -o full" does not work.

It seems that both copies of the "stale extents" were actually "stale" and now it is not possible to fix the file system. Am I right?

Before reboot the system, I tried to lvreduce the mirror and lvsplit unsuccesfully. There was a way to solve the problem before rebooting ?

I think now, the only way is to restore from ignite or to newfs "/var" and find a way to restore its contents from the ignite tape.

Regards,

Prashanth.D.S · ‎06-18-2010

Hi,

In my opinion fault was actually on c0t5d0 and c0t4d0 was in good condition.

c0t4d0 couldnt sync up with c0t5d0 hence you notice error in syslog.

Proper procedure would have been according to me was..

Reboot the box, try booting from primary boot disk (without quorum) and check if it works, if not boot from alternate again without quorum. This way we could have found out which disk was the culprit. Reducing LV was not an option here.

You could have tried a dd on c0t5d0 aswell. Also you may check for read and write errors on these disks from cstm.

Best Regards,
Prashanth

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Failing disk in vg00, but stale ext in its mirror

Failing disk in vg00, but stale ext in its mirror