topic Re: make_net_recovery dangling in Operating System - HP-UX

make_net_recovery dangling

Ralph Grothe — Wed, 14 Mar 2007 08:13:26 GMT

Hi,

I need to urgently patch an HP test box (given only limited time scope for installing GOLDQPK11i of Dec 2006 Support+).
Because I haven't applied this patch before (latest was June 2006) I wanted to make sure to have a recent Ignite image in advance.
I already meant to have done this by yesterday when I had a make_tape_recovery lingering for over 3 hours after that I deemed this attempt futile and killed the make_* procs smoothly, allowing them to remove their locks.
For this box I had to revert to make_tape_recovery because of incompatible releases of Ignite on this client (C.6.8.152) and the server (C.6.7.79) where in the recovery.log I was given the dissatisfying advice to either upgrade the server, or downgrade the client.
Upgrade of the server wasn't an option, as this one is part of three production servers igniting one another in a round-robin fashion.
This morning I had to dig through stock piles of HP Application CDs to at last arrive at one from March 2006 that luckilly bore the very release C.6.7.79 that I required to downgrade this client to align it with the server.
I must say, this is where Ignite really sucks.
On the one hand it requires you to have exactly same releases on clients and server, while on the other hand on the whole software.hp.com site I couldn't find a link that would lead me to a download for my, though obsolete, but required Ignite release.
After this pain was taken I started make_net_recovery as a batch job and am now experiencing the same idling.

This is how the Ignite procs appear on the client in the process table now.

# UNIX95= ps -x -o pid,ppid,stime,state,cpu,args -p 13896,13766
PID PPID STIME S C COMMAND
13896 13766 11:19:52 S 0 /usr/bin/sh /opt/ignite/data/scripts/make_sys_image -s local -L
13766 13765 11:19:49 S 0 /opt/ignite/bin/make_net_recovery -v -P s -s igux-server -x inc_entire=vg00 -x exclude=/tmp -x exclude=/var/adm/crash -x exclude=/var/tmp -x exclude=/var/spool/sw

When I attach to both PIDs with tusc I can see them sleeping on a read().

# tusc -apf 13896
( Attached to process 13896 ("/usr/bin/sh /opt/ignite/data/scripts/make_sys_image -s local -L") [32-bit] )
[13896] read(3, 0x77ff46b8, 1024) .......................................... [sleeping]

In the latest/recovery.log on the server these are the last written lines.

# tail /var/opt/ignite/clients/igux-client/recovery/latest/recovery.log
/dev/vg01/lv_data02 /data02 0

** 0 - The Volume Group or Filesystem is Not included in the
System Recovery Archive
** 1 - The Volume Group or Filesystem is Partially included in the
System Recovery Archive
** 2 - The Volume Group or Filesystem is Fully included in the
System Recovery Archive

* Checking Versions of Recovery Tools

I can see that the NFS shares are mounted on the client.

An lsof on the client's PIDs shows these open files.

# lsof -nP -p 13896,13766
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
make_net_ 13766 root cwd DIR 64,0x3 8192 5605 /root/scripts
make_net_ 13766 root txt REG 64,0x6 126976 70370 /opt (/dev/vg00/lvol6)
make_net_ 13766 root mem REG 64,0x7 36864 13819 /usr/lib/libnss_dns.1
make_net_ 13766 root mem REG 64,0x7 53248 14533 /usr/lib/libnss_files.1
make_net_ 13766 root mem REG 64,0x7 12794 6309 /usr/lib/tztab
make_net_ 13766 root mem REG 64,0x6 544768 70336 /opt (/dev/vg00/lvol6)
make_net_ 13766 root mem REG 64,0x7 221184 5999 /usr/lib/libCsup_v2.2
make_net_ 13766 root mem REG 64,0x7 282624 5839 /usr/lib/libm.2
make_net_ 13766 root mem REG 64,0x7 12288 6028 /usr/lib/libisamstub.1
make_net_ 13766 root mem REG 64,0x7 1261568 13838 /usr/lib/libcl.2
make_net_ 13766 root mem REG 64,0x7 1822720 6084 /usr/lib/libc.2
make_net_ 13766 root mem REG 64,0x7 24576 30789 /usr/lib/libdld.2
make_net_ 13766 root mem REG 64,0x7 274432 30787 /usr/lib/dld.sl
make_net_ 13766 root 0u REG 64,0x8 2411 4827 /var (/dev/vg00/lvol8)
make_net_ 13766 root 1w REG 64,0x8 292 28518 /var (/dev/vg00/lvol8)
make_net_ 13766 root 2w REG 64,0x8 292 28518 /var (/dev/vg00/lvol8)
make_net_ 13766 root 3wW REG 64,0x8 0 28520 /var (/dev/vg00/lvol8)
make_net_ 13766 root 4u REG 78,0x9 1609 7230 /var/opt/ignite/recovery/client_mnt/0x00306E08FFFF/recovery/2007-03-14,11:19/recovery.log
make_net_ 13766 root 5r FIFO 0x491b1c48 0t0 88059
make_sys_ 13896 root cwd DIR 64,0x3 8192 5605 /root/scripts
make_sys_ 13896 root txt REG 64,0x7 204800 30110 /usr/bin/rsh
make_sys_ 13896 root mem REG 64,0x7 24576 30789 /usr/lib/libdld.2
make_sys_ 13896 root mem REG 64,0x7 1822720 6084 /usr/lib/libc.2
make_sys_ 13896 root mem REG 64,0x7 274432 30787 /usr/lib/dld.sl
make_sys_ 13896 root 0u REG 64,0x8 2411 4827 /var (/dev/vg00/lvol8)
make_sys_ 13896 root 1w FIFO 0x491b1c48 0t0 88059
make_sys_ 13896 root 2w FIFO 0x491b1c48 0t0 88059
make_sys_ 13896 root 3r FIFO 0x48d237c8 0t0 88064
make_sys_ 13896 root 28r REG 64,0x6 103506 70301 /opt (/dev/vg00/lvol6)

Does anyone see what is possibly causing the deadlock?
Was my batch job approach (which always has worked so far) wrong?

Regards
Ralph

Re: make_net_recovery dangling

Ralph Grothe — Wed, 14 Mar 2007 09:25:17 GMT

Observed a hanging of an "lvlnboot -v vg00" that I executed in an tsm window.
Looks like I 've got a disk, controller or LVM problem on my Ignite client...

Re: make_net_recovery dangling

Florian Heigl (new acc) — Wed, 14 Mar 2007 10:44:58 GMT

Ralph,

I'm not sure this applies to your issue, but most of the time we track this down to stale (and hard) nfs mounts on the system.

is there anything to be picked from dmesg?

florian

Re: make_net_recovery dangling

Ralph Grothe — Wed, 14 Mar 2007 11:04:56 GMT

Ouch, don't know why haven't focused earlier on vmunix messages to syslog.
Today a couple of hours ago there was a SCSI bus hang according to some entries in syslog.log on the Ignite client (see below).
(but after that nothing else from vmunix was logged)

Though the lvlnboot returned after a while it looks like we've got a disk problem here.

# lvlnboot -v vg00
lvlnboot: LIF information corrupt or not present on "/dev/dsk/c2t2d0".
Use the "mkboot" command to initialize the LIF area.
Boot Definitions for Volume Group /dev/vg00:
Physical Volumes belonging in Root Volume Group:
/dev/dsk/c2t2d0 (0/0/2/0.2.0)
/dev/dsk/c1t2d0 (0/0/1/1.2.0) -- Boot Disk
Boot: lvol1 on: /dev/dsk/c2t2d0
/dev/dsk/c1t2d0
Root: lvol3 on: /dev/dsk/c2t2d0
/dev/dsk/c1t2d0
Swap: lvol2 on: /dev/dsk/c2t2d0
/dev/dsk/c1t2d0
Dump: lvol2 on: /dev/dsk/c1t2d0, 0

There are quite a few PEs stale by now.

# pvdisplay -v $(vgdisplay -v vg00|awk '/PV Name/{print$NF}')|grep -c stale
1077

Let's see if I can read from /dev/rdsk/c2t2d0

# timex dd if=/dev/rdsk/c2t2d0 of=/dev/null bs=1024k count=100

No it seems to hang.
This is pretty daft because only a few weeks ago one of the root disks of this box was replaced.
At least the 3 year warranty hasn't expired yet.

I am sorry for having bothered you.
These hangings always must have a natural cause...

Mar 14 11:24:38 igux-client vmunix: SCSI: First party detected bus hang -- lbolt: 137026454, bus: 2
Mar 14 11:24:38 igux-client vmunix: lbp->state: 5060
Mar 14 11:24:38 igux-client vmunix: lbp->offset: f0
Mar 14 11:24:38 igux-client vmunix: lbp->uPhysScript: 81fba000
Mar 14 11:24:38 igux-client vmunix: From most recent interrupt:
Mar 14 11:24:38 igux-client vmunix: ISTAT: 01, SIST0: 00, SIST1: 00, DSTAT: 84, DSPS: 00000010
Mar 14 11:24:38 igux-client vmunix: lsp: 0000000000000000
Mar 14 11:24:38 igux-client vmunix: lbp->owner: 00000000491a2e00
Mar 14 11:24:38 igux-client vmunix: bp->b_dev: bc022000
Mar 14 11:24:38 igux-client vmunix: scb->io_id: 21f5360
Mar 14 11:24:38 igux-client vmunix: scb->cdb: 28 00 00 00 00 01 00 00 10 00
Mar 14 11:24:38 igux-client vmunix: lbolt_at_timeout: 137023254, lbolt_at_start: 137023254
Mar 14 11:24:38 igux-client vmunix: lsp->state: 10d
Mar 14 11:24:38 igux-client vmunix: scratch_lsp: 00000000491a2e00
Mar 14 11:24:38 igux-client vmunix: Pre-DSP script dump [ffffffff81fba030]:
Mar 14 11:24:38 igux-client vmunix: 78347500 0000000a 78350800 00000000
Mar 14 11:24:38 igux-client vmunix: 0e000004 81fba540 e0100004 81fba7d4
Mar 14 11:24:38 igux-client vmunix: Script dump [ffffffff81fba050]:
Mar 14 11:24:38 igux-client vmunix: 870b0000 81fba2d8 98080000 00000005
Mar 14 11:24:38 igux-client vmunix: 721a0000 00000000 98080000 00000001
Mar 14 11:24:39 igux-client vmunix: SCSI: Resetting SCSI -- lbolt: 137026554, bus: 2
Mar 14 11:24:39 igux-client vmunix: SCSI: Reset detected -- lbolt: 137026554, bus: 2
Mar 14 11:25:17 igux-client vmunix: SCSI: First party detected bus hang -- lbolt: 137030354, bus: 2
Mar 14 11:25:17 igux-client vmunix: lbp->state: 1060
Mar 14 11:25:17 igux-client vmunix: lbp->offset: f0
Mar 14 11:25:17 igux-client vmunix: lbp->uPhysScript: 81fba000
Mar 14 11:25:17 igux-client vmunix: From most recent interrupt:
Mar 14 11:25:17 igux-client vmunix: ISTAT: 01, SIST0: 00, SIST1: 00, DSTAT: 84, DSPS: 00000010
Mar 14 11:25:17 igux-client vmunix: lsp: 0000000000000000
Mar 14 11:25:17 igux-client vmunix: lbp->owner: 000000004212ef00
Mar 14 11:25:17 igux-client vmunix: bp->b_dev: bc022000
Mar 14 11:25:17 igux-client vmunix: scb->io_id: 21f532b
Mar 14 11:25:17 igux-client vmunix: scb->cdb: 28 00 00 00 00 00 00 00 02 00
Mar 14 11:25:17 igux-client vmunix: lbolt_at_timeout: 137027154, lbolt_at_start: 137027154
Mar 14 11:25:17 igux-client vmunix: lsp->state: 10d
Mar 14 11:25:17 igux-client vmunix: scratch_lsp: 000000004212ef00
Mar 14 11:25:17 igux-client vmunix: Pre-DSP script dump [ffffffff81fba030]:
Mar 14 11:25:17 igux-client vmunix: 78347200 0000000a 78350800 00000000
Mar 14 11:25:17 igux-client vmunix: 0e000004 81fba540 e0100004 81fba7c8
Mar 14 11:25:17 igux-client vmunix: Script dump [ffffffff81fba050]:
Mar 14 11:25:17 igux-client vmunix: 870b0000 81fba2d8 98080000 00000005
Mar 14 11:25:17 igux-client vmunix: 721a0000 00000000 98080000 00000001
Mar 14 11:25:18 igux-client vmunix: SCSI: Resetting SCSI -- lbolt: 137030454, bus: 2
Mar 14 11:25:18 igux-client vmunix: SCSI: Reset detected -- lbolt: 137030454, bus: 2
Mar 14 11:26:15 igux-client vmunix: SCSI: Request Timeout -- lbolt: 137036186, dev: bc022000
Mar 14 11:26:15 igux-client vmunix: lbp->state: 60
Mar 14 11:26:15 igux-client vmunix: lbp->offset: ffffffff
Mar 14 11:26:15 igux-client vmunix: lbp->uPhysScript: 81fba000
Mar 14 11:26:15 igux-client vmunix: From most recent interrupt:
Mar 14 11:26:15 igux-client vmunix: ISTAT: 22, SIST0: 00, SIST1: 04, DSTAT: 00, DSPS: 81fba540
Mar 14 11:26:15 igux-client vmunix: lsp: 00000000491a2e00
Mar 14 11:26:15 igux-client vmunix: bp->b_dev: bc022000
Mar 14 11:26:15 igux-client vmunix: scb->io_id: 21f5360
Mar 14 11:26:15 igux-client vmunix: scb->cdb: 28 00 00 00 00 01 00 00 10 00
Mar 14 11:26:15 igux-client vmunix: lbolt_at_timeout: 137033054, lbolt_at_start: 137033054
Mar 14 11:26:15 igux-client vmunix: lsp->state: 10d
Mar 14 11:26:15 igux-client vmunix: lbp->owner: 00000000491a2e00
Mar 14 11:26:15 igux-client vmunix: scratch_lsp: 0000000000000000
Mar 14 11:26:15 igux-client vmunix: Pre-DSP script dump [ffffffff81fba020]:
Mar 14 11:26:15 igux-client vmunix: 00000000 00000000 41020000 81fba290
Mar 14 11:26:15 igux-client vmunix: 78347e00 0000000a 78350800 00000000
Mar 14 11:26:15 igux-client vmunix: Script dump [ffffffff81fba040]:
Mar 14 11:26:15 igux-client vmunix: 0e000004 81fba540 e0100004 81fba7f8
Mar 14 11:26:15 igux-client vmunix: 870b0000 81fba2d8 0a000000 81fba548
Mar 14 11:26:15 igux-client vmunix: SCSI: Abort abandoned -- lbolt: 137036186, dev: bc022000, io_id: 21f5360, status: 200

Re: make_net_recovery dangling

Florian Heigl (new acc) — Wed, 14 Mar 2007 12:22:00 GMT

Glad You found it.

btw, by 11.11 TCOE the scsictl command has an option to trigger the dreaded 'domain validation test', maybe think about running it twice daily, it should detect such errors. on the other hand, you could simply monitor the EMS event_log :)

Have a nice evening.

Re: make_net_recovery dangling

Ralph Grothe — Thu, 15 Mar 2007 03:39:31 GMT

Hi Florian,

thanks for reminding me of the scsictl option.
I will see if I can rig up a passive Nagios service check that could run twice or thrice a day.
Usually we have the EMS agent enabled on all our HP boxes.
But for some strange reason just on this box it has been disabled ;-)
I'm sure that otherwise I would have been noticed by email of the broken disk long before .
I have just filed an HW case via SCM, and asked HP for a replacement disk.

Re: make_net_recovery dangling

Ralph Grothe — Thu, 15 Mar 2007 03:49:20 GMT

Sadly this disk type doesn't seem to support the domain value test.
I performed it here on the undamaged root disk which is of euqal brand and model.

# scsictl -c domain_val /dev/rdsk/c1t2d0
domain_val: option is valid for only Ultra160 and later controllers.

# diskinfo /dev/rdsk/c1t2d0
SCSI describe of /dev/rdsk/c1t2d0:
vendor: HP 73.4G
product id: ST373454LC
type: direct access
size: 71687369 Kbytes
bytes per sector: 512

# scsictl -c get_bus_parms -c get_target_parms /dev/rdsk/c1t2d0

BUS LIMITS
----------
flags: 0x0
width: 16 bits (8 = Narrow; 16 = Wide)
req/ack offset: 31
xfer rate: 20000000
SPEED: 40 MB/s (Ultra Wide)

BUS PARMS
---------
flags: 0x0
width: 16 bits (8 = Narrow; 16 = Wide)
req/ack offset: 31
xfer rate: 20000000
SPEED: 40 MB/s (Ultra Wide)

TARGET LIMITS
-------------
flags: 0x0
width: 16 bits (8 = Narrow; 16 = Wide)
req/ack offset: 31
xfer rate: 20000000
SPEED: 40 MB/s (Ultra Wide)

NEGOTIATED TARGET VALUES
------------------------
flags: 0x0
width: 16 bits (8 = Narrow; 16 = Wide)
req/ack offset: 31
xfer rate: 20000000
SPEED: 40 MB/s (Ultra Wide)

Re: make_net_recovery dangling

Ralph Grothe — Thu, 15 Mar 2007 04:39:38 GMT

Ugh, in preparation for the replacement disk I now have a dangling lvreduce process which also holds a lock sentinel in /etc/lvmconf/lvm_lock.
Since a pvdisplay on the defective disk still reported its status, though as unavailable,
I considered this disk according to the Cookbook as still being "attached",
and went on audaciously issueing

# vgdisplay -v vg00|awk '/LV Name/{print$NF}'|xargs -n1 -i lvreduce -m 0 -A n {} /dev/dsk/c2t2d0

However, I am glad to notice that this runs into a timeout after a couple of minutes.
Looks as if I would have to isssue the lvreduce commands separately line by line...

Re: make_net_recovery dangling

Florian Heigl (new acc) — Thu, 15 Mar 2007 10:15:12 GMT

Hi,

the timeout is mostly the PV timeout, defaults to 90 seconds, but it seems to grow with every lv you reduce.

I forgot how to properly remove disks, I got lazy and sit through the timeouts nowadays.

- Search for that 'when good disks go bad' pdf by hp, and/or look for instructions for using lvreduce with the 'pv key'.
That way the disk will be silently wiped from the lv's config.

- And be sure to use lvreduce -A n to avoid the autobackup call, which will query all disks. Instead, after reducing all lv's and removing the pv from the vg, do vgcfgbackup /dev/vg00

Re: make_net_recovery dangling

Florian Heigl (new acc) — Thu, 15 Mar 2007 21:08:01 GMT

About scsictl:

it must be the controller, not the disk.
unfortunately
a) scsictl -c domain is not in HP-UX MC/EOE builds so we don't have it at work and I can't give you a list of controllers supporting it
b) not all (e.g. onboard) controllers are accompanied by EMS resource agents. I filed an enhancement request on this last year, but with no success.

Maybe that's why you didn't get a notification.

Re: make_net_recovery dangling

Ralph Grothe — Fri, 16 Mar 2007 05:52:44 GMT

> About scsictl:
> it must be the controller, not the disk.

Can't see how this scsictl command should applied to other than a disk device file.

I can't figure a device file for the controller, and passing scsictl the HW path doesn't please it.

e.g.

# ioscan -knfCext_bus
Class I H/W Path Driver S/W State H/W Type Description
=================================================================
ext_bus 0 0/0/1/0 c720 CLAIMED INTERFACE SCSI C896 Ultra Wide Single-Ended
ext_bus 1 0/0/1/1 c720 CLAIMED INTERFACE SCSI C896 Ultra Wide Single-Ended
ext_bus 2 0/0/2/0 c720 CLAIMED INTERFACE SCSI C87x Ultra Wide Single-Ended
ext_bus 3 0/0/2/1 c720 CLAIMED INTERFACE SCSI C87x Ultra Wide Single-Ended

# scsictl -c domain_val 0/0/2/0
scsictl: Can't open device 0/0/2/0.

# ioscan -knfH0/0/2/0
Class I H/W Path Driver S/W State H/W Type Description
=====================================================================
ext_bus 2 0/0/2/0 c720 CLAIMED INTERFACE SCSI C87x Ultra Wide Single-Ended
target 5 0/0/2/0.0 tgt CLAIMED DEVICE
disk 3 0/0/2/0.0.0 sdisk CLAIMED DEVICE SEAGATE ST318404LC
/dev/dsk/c2t0d0 /dev/rdsk/c2t0d0
target 6 0/0/2/0.2 tgt CLAIMED DEVICE
disk 4 0/0/2/0.2.0 sdisk CLAIMED DEVICE HP 73.4GST373454LC
/dev/dsk/c2t2d0 /dev/rdsk/c2t2d0
target 7 0/0/2/0.7 tgt CLAIMED DEVICE
ctl 2 0/0/2/0.7.0 sctl CLAIMED DEVICE Initiator
/dev/rscsi/c2t7d0

Ok, Initiator has a device file and sound plausible to me.
Unfortunately scsictl cannot communicate over it.

# scsictl -c domain_val /dev/rscsi/c2t7d0
scsictl: Can't open device /dev/rscsi/c2t7d0.

Btw, the disk has been replaced by now.
And the originating make_*_recovery passed as usually without any errors.

Re: make_net_recovery dangling

Florian Heigl (new acc) — Fri, 16 Mar 2007 07:52:02 GMT

Hi,

the domain validation test is controller initiated, but called by pointing at a disk :)

The scsi uw controllers might actually be too old; I only know about the domain validation test from the errors we sometimes saw, and that was on C1010 controllers and beyond. (LSI-based dual channel u3w scsi)