topic Reason for Package failure ? in Operating System - HP-UX

Reason for Package failure ?

Clive Nicholas — Sat, 30 Aug 2008 22:45:36 GMT

I have two HP9000 servers running HP-UX 11.00.
They are connected together to form a cluster consisting of the two nodes. The servers have 3 shared volume groups , with two disks mirrored disks in each volume group. These shared disks are activated in exclusive mode by the server running the single package at any one time.
The problem I have is that recently two disks have been replaced in these shared volumes (Each failed disk was in different volume groups) These failed disks were replaced and mirrored of the remaining good disk of the volume group in each case.
If I manually activate the shared volumes on each of the nodes in turn and view them and their logical volumes in SAM everything looks good.
However, now .. when restarting the servers, the cluster forms OK and the package appears to start OK... but then the node re-boots with the message "A crucial package has failed" and the package attempts to move to the other node. However, the package then fails in exactly the same way on the new node.
I have looked at the syslog.log file and the package.cntl.log files... but i cannot see a reason for the package failure.
Is there somewhere I should be looking in order to determine the reason for the package failure on both nodes ?

Re: Reason for Package failure ?

Deepak Kr — Sat, 30 Aug 2008 23:16:34 GMT

Clive,

what do you mean by following

>>package appears to start OK >>

Are you able to see package status running in

cmviewcl -vp pkgname
or
cmviewcl -v

Re: Reason for Package failure ?

Deepak Kr — Sun, 31 Aug 2008 00:00:56 GMT

Also provide version of serviceguard you are using on these nodes.

#swlist |grep -i serviceguard
also
#what /usr/lbin/cmcld

Re: Reason for Package failure ?

Deepak Kr — Sun, 31 Aug 2008 00:12:16 GMT

Have you changed the disk that was in clusterlock vg?

Re: Reason for Package failure ?

Clive Nicholas — Sun, 31 Aug 2008 08:43:16 GMT

Thanks for your responses and help KumarD.
With the cluster of both nodes up i can manually start the package (the package is named "package") on a node by entering:
# cmrunpkg package
The response is "cmrunpkg completed successfully on all packages specified"
If I quickly enter "cmviewcl -vp package" at this time I see the following response

PACKAGE STATUS STATE PKG_SWITCH NODE
package up running disabled kwamc0s

Policy Parametrs
POLICY NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters
ITEM STATUS MAXRESTARTS RESTARTS NAME
Service up 0 0 cmsmgr
Service up 0 0 ovdm
Service up 0 0 tmn
Service up 0 0 oracle
Service up 0 0 neos
Service up 0 0 lms
Service up 0 0 shut
Subnet up 128.4.0.0
Subnet up 192.168.1.0

Node_Switching_Parameters
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled kwamc0s (current)
Alternate up enabled kwamc1s

.... So at this point everything looks OK (neos and lms are our two custom application programs) and the package looks to be up and running to me.
But aftr a few seconds the message appears:
kwamc0s cmcld: Halting kwamc0s to preserve data integrity
Reason: A crucial package failed.

Re: Reason for Package failure ?

Clive Nicholas — Sun, 31 Aug 2008 08:53:42 GMT

#swlist |grep -i serviceguard reports the following:
PHSS_17581 1.0 MC ServiceGuard 11.05 Cummalative Patch

#what /usr/lbin/cmcld reports the following:
HP92453-02A.10.20 HP_UX SYMBOLIC DEBUGGER (END.0) $Revision 7403 $
Build Date: Wed mar 3 14:17:00 PST 1999
Build id: ibld_sg_a1105_patch
A.100.05 Date 99/02/22 PHSS_17581 (SG English/Japanese) PHSS_17483(LM English) PHSS_17484 (LM Japanese) Date: 99/01/13 PHSS_17230
Daemon
Config DB
Cluster Monitor
Command Srv
CommunicationSrv
Config
Dlm
Local Comm
Network Sensor
Package Manager
Remote Comm
API
Service Sesor
Cluster LVM
Status DB
Sync
Util
A.01.01 Resource Monitor API (11_00_AR: Oct 17 1997 09:24:32)

Re: Reason for Package failure ?

Steven E. Protter — Sun, 31 Aug 2008 09:01:42 GMT

Shalom,

All software within this installation is out of date and support.

That being said, the package seems to be shut down to to concerns about data integrity on shared storage.

You need to check all lock disks and shared disks configured within packages and configurations for trouble.

dmesg to start, perhaps disk exercise with mstm/cstm/xstm (your choice) to find the root of the problem.

It would not hurt to plan to bring this cluster back into the world of supported OS/system software and perhaps gain help from your HP Software service contract.

SEP

Re: Reason for Package failure ?

Clive Nicholas — Sun, 31 Aug 2008 09:11:56 GMT

Yes I believe the disk was changed that contained the clusterlock (Note the servers aren't actually in my station... I am helping out a colleage with these servers by remotely logging-in via the network to the servers in the Netherlands).
However, i had wondered about the clusterlock before and I think we have performed the correct procedure to ensure that the clusterlock is recreated correctly.
What I did was this:
cmhaltcl
vgchange -c n vg04
vgchange -c n vg05
vgchange -a y vg04
vgchange -a y vg05
cd /etc/cmcluster

I then did the cmapplyconf command to recompile and distribute the package, before deactivateing the volume groups and running the cluster again.

So, I believe the clusterlocks are OK, but is there a way to check this for sure?

Note: I have remotely logged into kwamc0s with a seperate console and I used the command
#tail -f /var/adm/syslog/syslog.log
to follow events in this log when the package is started on kwamc0s.
I see the line
cmcld: Service PKG*3841 terminated due to an exit(0)

However, this line appears immediately BEFORE the line shown below with the same timestamp which says:
cmcld: Started package package on node kwamc0s
I've no idea why the package is failing and rebooting the node.

Re: Reason for Package failure ?

Clive Nicholas — Sun, 31 Aug 2008 09:26:09 GMT

Thanks for your help Steven .. I will try what you suggest to check the shared disks.
I think the problem with support is that the applications being run use HP Openview DM TMN and support for this has been discontinued by HP. The servers help supervise an older submarine cable system and there is no means to update the application software, so the OS has not been touched for many years an we would not be permitted to upgrade it (by managers on high !). These type of systems are installed and commisioned and then we are not permitted to be upgraded/patch further after the initial commissioning/proving period unless there is an exceptional requirement and proof that any work will not affect current custom built application software.

Re: Reason for Package failure ?

Deepak Kr — Sun, 31 Aug 2008 12:09:29 GMT

Clive try following regarding cluster lock disk::

Run
# cmapplyconf -v -C /etc/cmcluster/cluster-configfilename

and then start package.

I guess this could be due to missing lock info on newly replaced disk.

or

If any previous backup is exists for lock vg configuration then

#vgcfgrestore -n /dev/lock-vg-name disk-device-name
example

#vgcfgrestore -n /dev/vg_lock /dev/dsk/c4t6d0

Yes, it is absolutely true that serviceguard version you are running is no more supported by HP.