TruCluster
Showing results for 
Search instead for 
Do you mean 

FYI: can't boot after Business Copy undo problem

Regular Advisor

FYI: can't boot after Business Copy undo problem

Environment:
- Trucluster 5.1B PK3
- Business Copy Server v2.3 build 86 on SMA appliance
- BC Tru64 Agent v2.3.0.0

The job creates a snapshot of a disk and mount it in the same cluster node (node1) for a tape backup.

After a failed undo, the snap didn't got deleted and I got AdvFS Panics accessing it.
We get I/O errors and can't umount the snap.
After hwmgr scan scsi, the dskxxx device was deleted (at least didn't show up anymore).
We try to replace the underlying dsk device of the sanp domain in node1. The other node (node2) can't see the snap volume, but has some reference to it.

To fix the issue I decide to reboot node1 ...
clsm: initialized
Waiting for cluster mount to complete
vm_swap_init: swap is set to lazy (over commitment) mode
PowerTermCMS: Joining deferred filesystem sets

trap: invalid memory read access from kernel mode

faulting virtual address: 0xfffffe04fef425b0
pc of faulting instruction: 0xffffffff0005cc80
ra contents at time of fault: 0xffffffff0005cc80
sp contents at time of fault: 0xfffffe072423f8c0

panic (cpu 0): kernel memory fault
syncing disks... done
drd: Clean Shutdown


DUMP: blocks available: 33028096

DUMP: blocks wanted: 759106 (partial compressed dump) [OKAY]

===================
Can't boot again.

Solution:
===================
From node2:
Node2 don't see any mountes snap, but:
#cfsmgr -u -d BCV-domain
"umount" and delete any steal reference to it.
After that I delete the
/etc/fdmns/BCV-domain file

Now I can boot node1 sucessfully.

The BC agent for Tru64 have been a nightmare since the first day. We will move to RSM very soon and wait to improve ...

Any comment about RSM HA for Tru64 ??

regards
antonio
//Add this to "OnDomLoad" event