topic Re: LVM POWERFAIL message in Operating System - HP-UX

LVM POWERFAIL message

Kevin Bingham — Fri, 29 Apr 2005 10:54:47 GMT

Hi,

I had a problem on our L1000 machine running 11.11, where an Oracle 9.2.0.4 database was having an "Update Statistics" job run while the application that uses that database was active. After several minutes it became clear that there was some kind of deadlock in the system, as simple commands like "bdf" and "ls -ltr" did not respond in the telnet sessions where they were invoked. Also, it became impossible to login into telnet as anything but root.

I tried (logged on as root) to kill (using kill -9 pid) all of the user processes that may have been interfering witht he Oracle Update Stats job, and all but two could be killed. So then the only thing that I could think of doing was to reboot the machine, which I tried to do. When I initiated the restart (shutdown -R 0) everything seemd to go ok, until it got to the point where it needed to unmount the file systems. This "hung" for 1.5 hours while I was at lunch (this machine is not critical in our environment). I checked the Console at this point and saw the following text:

LVM: vg[1]: pvnum=0 (dev_t=0x1f022000) is POWERFAILED
DIAGNOSTIC SYSTEM WARNING:
The diagnostic logging facility has started receiving excessive errors from the I/O subsystem. I/O error entries will be lost until the cause of the excessive I/O logging is corrected.
If the diaglogd is not active use the Daemon Startup command in stm to start it.
If the diaglogd daemon is active, use the logtool utility in stm to determine which I/O subsystem is logging exces

at which point the text was cut off as if in mid-sentence.

I have to admit to being a novice when it comes to HP-UX System Admin, so I turned to this forum for answers.

I found several hits on "LVM POWWERFAILED" and they seemed to indicate that there may be an imminent h/w failure of the disk c2t2d0 (which I identified from ""ll /dev/rdsk| grep 22000", found in one of those hits).

Since then I have done the following:
1) powered of the hung system
2) powered on the system (which booted normally, but had some lbolt errors in the syslog)
3) shutdown -hy0
4) remove power from the system
5) open the cabinet, and re-seat the drives
6) power on the system (this time no errors in syslog)

So, after all that, now for my question...

Is there anyway to check the disk that was reported in error, e.g. a similar utility to SACNDISK in Windoze, or some other utility that will report on the health of the disk. I am sure one must exsist and I am assuming that it is my novice status that prevents me from finding it. Any other advice on housekeeping or such would also be welcome. I already run a defrag of the disk on a weekly basis.

Thanks in advance for detailed answers.

Regards
Kevin

Re: LVM POWERFAIL message

Geoff Wild — Fri, 29 Apr 2005 11:00:02 GMT

You can check with a dd:

dd if=/dev/rdsk/c20t5d0 of=/dev/null bs=64k

Change the c20t5d0 to your devs...

If any errors - the dd will abort...

Rgds...Geoff

Re: LVM POWERFAIL message

erics_1 — Fri, 29 Apr 2005 11:07:37 GMT

Kevin,

If you have support tools installed, you can check for errors logged to the disk with stm.

cstm
cstm>map
cstm>select device <#> -- # of disk from list
cstm>info
cstm>infolog

Regards,
Eric

Re: LVM POWERFAIL message

Florian Heigl (new acc) — Fri, 29 Apr 2005 13:32:51 GMT

Hi Kevin,

at first - the re-seating and power-cycling of the disk might have worked, but You can conly be sure by going to the whole log entries the disk produced.

cstm should give You the disks' current status, and dd'ing will ensure every single block is still readable.
the 'lbolt'-errors mean hp-ux was inable to write or read from a specific disk block.
usually bad block relocation is active and works automatically, so this should have been properly handled by the LVM.
You should still check that everything is really available.

I'm not completely sure, that Your data is completely mirrored - try the following:

# check mirror states
for vg in `vgdisplay -v | grep "VG Name" | awk '{print $3}'`;
do
lvdisplay $vg/lv* | egrep 'Name|Mirror'
done

- this should return something like:

LV Name /dev/vg00/lvol9
VG Name /dev/vg00
Mirror copies 1

not everything HAS to be mirrored, but the chance of the system locking up due to outstanding I/O is lower when there is no outstanding I/O because the target LV is mirrored :)

for every vg in the system, do
# check no STALE PE exist
lvdisplay -v /dev/vgNN/lvol* | grep -ci stale
(should return 0, nothing)

if a LE is marked as stale, things depend:
if only one PE is stale, go ahead and let hp check it over and then replace the disk; if both are stale, there might be data lost, check if there is missing data, and locate the root issue and possibly recover. (this rarely is the case :)

for the housekeeping part, a weekly defrag is unneccessary - while extent-based filesystems like vxfs is are more prone to some fragmentation, in most cases once per quarter is often enough. exceptions might be highly volatile filesystems like mail spools, etc.

If You can get some downtime at one of the next weekends, then think about the following:

- ignite of whole vg00 to tape
/opt/ignite/bin/make_tape_recovery -A -a /dev/rmt/0mn
- reboot the system to runlevel 1
- fsck -Fvxfs -o full,nolog all filesystems

after that and with no additional disk issues I'd feel confident that everything is still in good shape.

recommended reading would be the STM and EMS manuals, and You could think about running a
tail -f /var/adm/syslog/syslog.log via ssh from Your workstation, so that You're immediately updated when something goes wrong and to get a feeling for normal and less normal system behaviour.

Re: LVM POWERFAIL message

James George_1 — Fri, 29 Apr 2005 14:57:29 GMT

Hi

Do lvdisplay -v /dev/vgxx/lvolxx on all the LVs on the disk and look for any stale extents. And also, do
# pvdisplay -v /dev/dsk/cxtxdx on your disk and see the PV status .. If it is unavailable , you will have to replace this disk..

James

Re: LVM POWERFAIL message

Devender Khatana — Fri, 29 Apr 2005 23:20:12 GMT

Hi,

This message normally comes when the power to the drive is recycled. This can be due to any reason I mean loose connection in power connectors , Power connector extendor etc. You said even after power recycle it was displaying L-Bolt errors and then you refixed disk after powering off the machine again. Did you notice something like loose at this point ?

It seems it was a interminent disk problem /losse connection. Still it is not advised to use this device for important data untill you are sure it is not creating any problems in future. A god advice here will be to mirror this disk with some spare in your system and observe it for some time so that your system is not drived to deadlock if it happens to experience problems again. Allthough dd will check every block of disk now but it is not faithful if your drive has something else then media problems and specially intermediant problems.

HTH,
Devender

Re: LVM POWERFAIL message

Kevin Bingham — Tue, 03 May 2005 04:51:29 GMT

Hi

Thanks to all those who have replied.

More info about our machine: we do NOT use mirroring at all, the machine is used only for porting work.

I have tried several of the (applicable) suggestions to find errors on the disk but so far they come back clean. I am currently running the "dd if=/dev/rdsk/c2t2d0 of=/dev/null bs=64k" and I guess that this will probably take a while to complete. I will let you know the results.

Also, during the weekend, we suffered a mains power failure and the UPS did not allow us to shut the machine down properly (it's quite low down on the pecking order as it is used for porting only) so the machine had a power based restart. I checked the syslog and again there was an occurrence of the "lbolt" message, with details below:

---begin snip---
May 2 20:25:19 hp2 vmunix: SCSI: First party detected bus hang -- lbolt: 3059, bus: 2
May 2 20:25:19 hp2 vmunix: lbp->state: 1060
May 2 20:25:19 hp2 vmunix: lbp->offset: 40
May 2 20:25:19 hp2 vmunix: scb->io_id: 2000003
May 2 20:25:19 hp2 vmunix: scb->cdb: 12 00 00 00 80 00
May 2 20:25:19 hp2 vmunix: lbolt_at_timeout: 2959, lbolt_at_start: 2459
May 2 20:25:19 hp2 vmunix: lsp->state: 5
May 2 20:25:19 hp2 vmunix: scratch_lsp: 0000000041218800
May 2 20:25:19 hp2 vmunix: Pre-DSP script dump [fffffffff87ba020]:
May 2 20:25:19 hp2 vmunix: 00000000 00000000 41020000 f87ba290
May 2 20:25:19 hp2 vmunix: 78344000 0000000a 78351000 00000000
May 2 20:25:19 hp2 vmunix: Script dump [fffffffff87ba040]:
May 2 20:25:19 hp2 vmunix: 0e000005 f87ba540 e0100004 f87ba7f8
May 2 20:25:19 hp2 vmunix: 870b0000 f87ba2d8 98080000 00000005
May 2 20:25:19 hp2 vmunix: SCSI: Resetting SCSI -- lbolt: 3659, bus: 2
May 2 20:25:19 hp2 vmunix: SCSI: Reset detected -- lbolt: 3659, bus: 2

---end snip---

I don't know if this sheds any more light on the nature of the problem...

Thanks in advance
Kevin

PS, I have STM, but where do I find the Doc's for STM and EMS

Re: LVM POWERFAIL message

Kevin Bingham — Thu, 05 May 2005 06:39:04 GMT

Luckily for us the machine was still inside warranty, so I have a new disk and am busy transferring data...

Thanks to all who replied.

Re: LVM POWERFAIL message

Kevin Bingham — Thu, 05 May 2005 06:39:40 GMT

as per previous reply