1827293 Members
2966 Online
109717 Solutions
New Discussion

EMC failure

 
Larry Basford
Regular Advisor

EMC failure

One physical disk on a 7disk vg went into powerfailure state. The EMC is connected with 2 controlers and the volume group is striped assigning primary every other controler. One disk of the mirrored pair failed and the volume group was inaccessable.
vgdisplay: Warning: couldn't query all of the physical volumes.
inq (EMC command) showed all the volumes
reboot of system took a long time.
We finally got it to return by doing a vgchange -a y /dev/vgname

Any comments welcome. EMC is still investigating.

Desaster recovery? Right !
9 REPLIES 9
Stefan Farrelly
Honored Contributor

Re: EMC failure


This is very interesting.

Normally if a physical disk fails everything continues running off the other/mirror disk - but you said 'the volume group was inaccessable'. What made you think it was inaccessible ?

Then you rebooted and of course the activation of the VG failed due to a physical drive being unavailable. Normal procedure is to only reboot if youre going to replace the bad drive at the same time - that way the VG will activate and all you have to do is a vgcfgrestore and a vgsync.

So why did you reboot without replacing the bad drive ? hoping to fix the inaccessible vg ?
Im from Palmerston North, New Zealand, but somehow ended up in London...
Craig Rants
Honored Contributor

Re: EMC failure

That is usually the case with an EMC disk failure that I have seen, although EMC disks rarely fail in my experience. If you have a good backup of the information contained in the vg, a vgexport would also work to get rid of the LVM problem. You should then be able to recreate the vg and lvols. Let us know what they find out.

GL,
C
"In theory, there is no difference between theory and practice. But, in practice, there is. " Jan L.A. van de Snepscheut
Larry Basford
Regular Advisor

Re: EMC failure

My boss was trouble shooting this while I was not here.


vgdisplay: Warning: couldn't query physical volume "/dev/dsk/c4t1d0":
The specified path does not correspond to physical volume attached to
this volume group
vgdisplay: Warning: couldn't query physical volume "/dev/dsk/c6t1d0":
The specified path does not correspond to physical volume attached to
this volume group
vgdisplay: Warning: couldn't query all of the physical volumes.
PV Name /dev/dsk/c4t1d4
PV Name /dev/dsk/c6t1d4 Alternate Link
PV Status available
Total PE 1078
Free PE 1

Plus the syslog
Aug 30 06:58:22 ncshp9 vmunix: SCSI: Request Timeout; Abort Tag -- lbolt: 50979273, dev: 1f061000, io_id: 6307e4e
Aug 30 06:58:28 ncshp9 vmunix: LVM: vg[3]: pvnum=0 (dev_t=0x1f041000) is POWERFAILED

Desaster recovery? Right !
Stefan Farrelly
Honored Contributor

Re: EMC failure


Those messages about a PV not being available and Powerfailed are perfectly normal. They DONT mean the VG is unavailable.

Normal procedure in this case is to leave the system alone - the VG still works via the mirror disk and only when you have a replacement drive read to install shut it down, replace the drive, reboot and resync.

Rebooting without replacing the bad drive is both not necessary or advised (due to problems you already saw with the VG being unable to activate automatically due to a PV being unavailable).
Im from Palmerston North, New Zealand, but somehow ended up in London...
Larry Basford
Regular Advisor

Re: EMC failure

The file system on that volume group was not accessable.
/dev/dsk/c4t1d0 is the disk that failed in the EMC
/dev/dsk/c6t1d0 is the mirror and the alternate path
--- Physical volumes ---
PV Name /dev/dsk/c4t1d0
PV Name /dev/dsk/c6t1d0 Alternate Link

FROM YOUR NOTES
--- Logical volumes ---
LV Name /dev/vg2hp9/lvol11
LV Status available/syncd
LV Size (Mbytes) 30156
Current LE 7539
Allocated PE 7539
Used PV 6

-----
NOW
--- Logical volumes ---
LV Name /dev/vg2hp9/lvol11
LV Status available/syncd
LV Size (Mbytes) 30156
Current LE 7539
Allocated PE 7539
Used PV 7

The system does not need to be shutdown to replace the mirrored disk in the EMC
This is no funky sw mirror of a lvol.

The system should not even not even blink at the loss of the disk. (per EMC)
Desaster recovery? Right !
Ashwani Kashyap
Honored Contributor

Re: EMC failure

Looking at your syslog , it seems that the SCSI request timed out and also hung the SCSI/FIBRE bus .

THe problem is very unusual . But the kind of error that was logged in you syslog , I got around it by increasing the timeout period of all EMC disks from deafoult to 180 Secs .

Use pvchange to do that .

Hope it helps .
Larry Basford
Regular Advisor

Re: EMC failure

I have considered the pvchange -t 180 /dev/dsk/??????

It would need to be done to every disk in use on the EMC.

This has run for 2 years with no problem.

What about the pvchange autoswitch

It is not set.
Desaster recovery? Right !
James R. Ferguson
Acclaimed Contributor

Re: EMC failure

Hi Larry:

'autoswitch' defaults to "on" by default. This means that LVM will switch from an alternate pv_link to its primary if the primary has failed and recovered. If you have disabled it, I would re-enable it:

# pvchange -S Y

Regards!

...JRF...
Larry Basford
Regular Advisor

Re: EMC failure

I'm thinking it isn't switching.
This system
pvdisplay /dev/dsk/c4t1d0
--- Physical volumes ---
PV Name /dev/dsk/c4t1d0
PV Name /dev/dsk/c6t1d0 Alternate Link
VG Name /dev/vg2hp9
PV Status available
Allocatable yes
VGDA 2
Cur LV 1
PE Size (Mbytes) 4
Total PE 1078
Free PE 1
Allocated PE 1077
Stale PE 0
IO Timeout (Seconds) default
OTHER
--- Physical volumes ---
PV Name /dev/dsk/c1t6d0
VG Name /dev/vg00
PV Status available
Allocatable yes
VGDA 2
Cur LV 8
PE Size (Mbytes) 4
Total PE 1023
Free PE 0
Allocated PE 1023
Stale PE 0
IO Timeout (Seconds) 120
Autoswitch On
Desaster recovery? Right !