Operating System - Tru64 Unix
1752577 Members
4432 Online
108788 Solutions
New Discussion юеВ

Re: ES40 won't boot properly after system restore

 
SOLVED
Go to solution
Nick Bishop (Kiwi)
Frequent Advisor

ES40 won't boot properly after system restore

We had some disk failures on our ES40. We've put in replacement disks, partitioned the disk array, and restored from backups.

We did the restore while booted to a Tru64 CDROM.

Now, when we boot single-user mode to the actual array, several strange things happen.

1. Get a message during boot saying vm_swap_init: Unexpected swapon for swap device /dev/disk/dsk0b

2. hwmgr -view device
gives no output (no disk, no tape, nothing)

3. disklabel -r dsk0
Error message: No such device or address

If I look into /dev/disk, I see the special device files for dsk0{a,b,c,d,e,f,g,h}

If I boot onto CDROM, the disk appears perfectly normal as dsk0 (and that's how it was before the failure).

Now did I do something wrong during the root filesystem restore?????

System details:
ES40, running Tru64 5.1B-4. The system has some patches (but obviously the CD does not). The SRM console version 7.2-1, and the NHD-7 CD has been applied.

The "disk" is actually a 6*72GB RAID5 array, the array is a HP SmartArray 5300A (v3.56). One disk failed, then a second disk failed before the first replacement arrived.

P00>>> show config
Slot Option Hose 0, Bus 0, PCI
3 HP Smart Array 5300A pya0.0.0.3.0
dya0.0.0.3.0
(and more)
P00>>> show device
dya0.0.0.3.0 DYA0 CPQCISS
pya0.0.0.3.0 PYA0
(and more)

I haven't attempted multi-user (I think the above is more than enough problems).

I've got certain types of hwmgr output prior to the failure. Let me know if it is needed and I'll get it extracted. However, I do not have the disklabel -r output from before the failure: I had to guestimate the sizes during the restore.

Ideas?
8 REPLIES 8
Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

Certain output repeated with "retain formatting". It's a bit useless, so you'll have to drag it to notepad or gedit to see the real mccoy.

P00>>> show config
Slot Option Hose 0, Bus 0, PCI
3 HP Smart Array 5300A pya0.0.0.3.0
dya0.0.0.3.0
(and more)
P00>>> show device
dya0.0.0.3.0 DYA0 CPQCISS
pya0.0.0.3.0 PYA0
(and more)
Martin Moore
HPE Pro

Re: ES40 won't boot properly after system restore

What has probably happened is that the replacement disks show up as different device names. Tru64 UNIX version 5 bases disk device names on worldwide ID's (WWID's), and if these change (which will almost certainly happen if you replace a disk), the OS will recognize that the disk isn't the same one it knew before.

Fortunately, this usually isn't too difficult to fix. A good starting point woudl be to post the output of "hwmgr -show scsi" from before and after the replacement, if you have both.

Martin
I work for HPE
A quick resolution to technical issues for your HPE products is just a click away HPE Support Center
See Self Help Post for more details

Accept or Kudo

Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

Hi Martin,

I'll try the hwmgr -show scsi when I get back on Monday morning (approx 22:30 Sun, GMT), but the "hwmgr -view device" giving no output doesn't bode well. I'll extract the hwmgr prior to the failure, as well.

Do you think I should ...
# hwmgr -scan scsi
first?

Bear in mind the root filesystem gets left read-only, and that /usr, /var, and /tmp are all separate filesets, in my setup.
Martin Moore
HPE Pro
Solution

Re: ES40 won't boot properly after system restore

> Bear in mind the root filesystem gets left read-only, and that /usr, /var, and /tmp are all separate filesets, in my setup.

That's exactly what I would expect to happen in a device-name change scenario. Root gets initially mounted (read-only) because that's where the kernel is booting from. But when the system tries to re-mount root read/write, it fails because it's on the "wrong" device. (It's likely that if you look at the output of "mount" after trying to boot, it will show root mounted on root_device rather than root_domain.) Then the other filesystems fail to mount because they're also on the wrong devices.

I seem to recall that SA5300 devices might not show up in the output of hwmgr -show scsi, although I'm not 100% sure of that and can't check at the moment. Can you post the output of hwmgr -view dev (1) before the change, (2) while booted from CD, and (3) (if it works) after trying to boot from the replacement disk?
I work for HPE
A quick resolution to technical issues for your HPE products is just a click away HPE Support Center
See Self Help Post for more details

Accept or Kudo

Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

Thank-you very much, Martin.

The hwmgr -show scsi gave me a clue, and hwmgr -show scsi -full indeed confirmed the WWIDs had changed compared to before.

Some trickery with hwmgr -delete and dsfmgr solved the problems. Details will follow in the next day or so (I'm still madly pressing the server into service).

Nick.
Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

OK, here was the approximate flow of how I solved the problem.

I've tagged the command inputs and outputs below like this:
now#
(when the system is booted on restored disk)

before#
(saved output from before the failure)

CD#
(when the system is booted on CD)

>>>
(at the console prompt: more precisely a P00>>> prompt)

###
(Explanatory comments, not part of the input or output)

Step 1
======
I had located a Compaq patch advisory, for i2o disks, that suggested this procedure. I wrote it down, but did not execute it at this stage.

>>> boot -fl s
# mountroot
# /sbin/hwmgr view devices
HWID: Device Name
58: /dev/disk/dsk5c
### Note bogus name dsk5

# cat /cluster/members/(member)/etc/i2oNameData.log
25: iop-0-tid-514: dsk0
### Note former HWID = 25

# hwmgr -delete component -id 25
# /sbin/hwmgr -R hwid 25
# /sbin/dsfmgr -m dsk5 dsk0
### renames (-m)oves disk
# shutdown -h now

Note I did not execute this procedure, but simply noted it.


Step 2
======
### Execute as suggested by Martin
now# hwmgr -show scsi
HWID Scsi host type subtyp owner #path dev 1st_path
69: 0 (none) disk none 0 1 (null)
-1: 4 (none) disk none 2 1 (null) [1/0/0]

### Digging into the previous (saved) output, it so happened I did have the output from that same command ...
before# hwmgr -show scsi
69: 0 revan disk none 2 1 dsk0 [1/0/0]

### As an additional step I got -full output
now# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 (null)

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0002

Bus Target Lun Path Status
1 0 0 Stale

-1: 4 (none) disk none 2 1 (null) [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0003
### Note the last digit different on the "new" device

Bus Target Lun Path Status
1 0 0 Valid

### I also had that same command output saved away.
before# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 dsk0 [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0002

Bus Target Lun Path Status
1 0 0 Valid

### For curiosity, on the CD
CD# hwmgr -show scsi
69: 0 revan disk none 2 1 dsk0 [1/0/0]

CD# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 dsk0 [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0003

Bus Target Lun Path Status
1 0 0 Valid

### So I see the HW database has the old WWID recorded against B/T/L 1/0/0 (Stale), and it sees a new WWID against the same B/T/L.

Step 3
======
### A variation of the procedure described in Step 1

>>> boot -fl s
### note swap error msg
now# mountroot
### get output for creation of extra device files for dsk1

now# hwmgr -show scsi
HWID Scsi host type subtyp owner #path dev 1st_path
69: 0 (none) disk none 0 1 (null)
91: 4 (none) disk none 2 1 (null) [1/0/0]
### OK, my disk now has HWID 91

now# hwmgr -delete component -id 69
now# dsfmgr -R hwid 69
now# dsfmgr -m dsk1 dsk0
### output for all the partition device names being renamed
now# shutdown -h now

>>> boot -fl s
### note no swap error this time
now# disklabel -r dsk0
### Good output, listing partitions a-h.

now# shutdown -r -s now
### wait for reboot
### See all other filesystems being mounted
### Problem solved, but see Step 4.

Step 4
======
Not entirely sure if relevant, but I found that I got an invitation to do "configuration" (post-install), my X didn't come up, and most of my /var filesystem was missing. A quick reboot to single-user mode, mount that filesystem on a temporary mount, and restore that again from backup.

After a reboot, the OS was fully functional.

Nick.
Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

Martin:
> It's likely that if you look at the output of "mount" after trying to boot, it will
> show root mounted on root_device rather than root_domain

That was exactly true. Although I find that's also normal when you've just booted single-user.

The difference: if you attempt
# mount -u /

it is silent if all is well, but for me it complained of mismatched device names until I had fixed the device name-change situation.
Nick Bishop (Kiwi)
Frequent Advisor

Re: ES40 won't boot properly after system restore

Some trickery with hwmgr and dsfmgr was required (described above). The server has been running a week now without missing a beat.

Nick.