Operating System - HP-UX
1832577 Members
3273 Online
110043 Solutions
New Discussion

Re: catch22; trapped by failed disk

 
Ralph Grothe
Honored Contributor

catch22; trapped by failed disk

Hello,

a disk seems to have a defect.
The usual test read of some blocks from the raw device, say
dd if=/dev/rdsk/c2t9d0 of=/dev/null bs=1024k count=10
hangs so much that it even ignores any SIGKILLs.
Additionally, the kernel scsi driver spills syslog with messages of this kind:

Feb 26 12:04:49 ganymed vmunix: SCSI: Unexpected Disconnect -- lbolt: 102770286,
dev: bc029000, io_id: 201dab0

Unfortunately, the affected disk is the first mirror disk of the VG that acts as cluster lock disk of a two node SG cluster.

This results in the nice feature that every attempt to release the disk from the current LVM configuration (e.g. lvreduce with and without the -k option, vgremove etc.) hangs in the same way as the dd.
Of course the same is true for attempts to reconfigure the cluster's binary, as every cmapplyconf hangs too.

Even when I force the scsci bus to reset by pulling (and replugging) the hot swap disk each system command which communicates with the device hangs as well.

Looks to me like the hen and egg paradoxon.

Has anyone the break spell from this infinite loop?

Regards
Ralph
Madness, thy name is system administration
6 REPLIES 6
harry d brown jr
Honored Contributor

Re: catch22; trapped by failed disk

Ralph,

Have you tried replacing the bad disk with a good one?

live free or die
harry
Live Free or Die
A. Clay Stephenson
Acclaimed Contributor

Re: catch22; trapped by failed disk

Hi Ralph:

I would pull the bad disk; replace it with a good one; and start the normal procedure.
1) vgcfgrestore 2) vgchange -a y 3) vgsync

This really should be no different from a completely failed disk. I never bother with the lvreduce/vgreduce operation.
If it ain't broke, I can fix that.
Sanjay_6
Honored Contributor

Re: catch22; trapped by failed disk

Krishna Prasad
Trusted Contributor

Re: catch22; trapped by failed disk

I am not sure if the fact that it is a locked disk will make you reboot. However, I do know that under normal circumstances you do not need to reboot.

replace bad disk
vgcfgrestore /dev/vg00
vgsync
Positive Results requires Positive Thinking
Krishna Prasad
Trusted Contributor

Re: catch22; trapped by failed disk

add in the vgchange -a y /dev/vg00 in the middle of my post
Positive Results requires Positive Thinking
Ralph Grothe
Honored Contributor

Re: catch22; trapped by failed disk

Thanks to everyone for their suggestions.

Unfortunately, replacing the disk with a new one isn't really an option in this case.

The two nodes consist of old D-class boxes which were considered to be scrapped down.

Because our planners and decision makers decided to introduce Tivoli so that they needed a test platform.

That was how I could "save" the two boxes, as we had a reasonable occupation for them.

But of course, I wouldn't get new hardware such as hard disk replacements.

Despite, I yesterday somehow managed to get the defect disk from the LVM and cluster configuration by drawing and replugging the hot swapable disk, thus initiating a scsi bus reset.
I tried this a couple of times, whereafter I always issued a diskinfo command.
When the disk finally reported its characteristics I was able to issue the lvreduce command which was quitted by a success message :-) and after that a system panic :-(
However, when the machine was up again I could confirm that the lvreduce must have been successful since in the lvdisplay command the defect disk didn't appear anymore.
After that the rest was easy, and my cluster runs now with a different lock disk.
Madness, thy name is system administration