Re: Reason for Package failure ?

Clive Nicholas · ‎08-30-2008

I have two HP9000 servers running HP-UX 11.00.
They are connected together to form a cluster consisting of the two nodes. The servers have 3 shared volume groups , with two disks mirrored disks in each volume group. These shared disks are activated in exclusive mode by the server running the single package at any one time.
The problem I have is that recently two disks have been replaced in these shared volumes (Each failed disk was in different volume groups) These failed disks were replaced and mirrored of the remaining good disk of the volume group in each case.
If I manually activate the shared volumes on each of the nodes in turn and view them and their logical volumes in SAM everything looks good.
However, now .. when restarting the servers, the cluster forms OK and the package appears to start OK... but then the node re-boots with the message "A crucial package has failed" and the package attempts to move to the other node. However, the package then fails in exactly the same way on the new node.
I have looked at the syslog.log file and the package.cntl.log files... but i cannot see a reason for the package failure.
Is there somewhere I should be looking in order to determine the reason for the package failure on both nodes ?

Deepak Kr · ‎08-30-2008

Clive,

what do you mean by following

>>package appears to start OK >>

Are you able to see package status running in

cmviewcl -vp pkgname
or
cmviewcl -v

"There is always some scope for improvement"

Deepak Kr · ‎08-30-2008

Also provide version of serviceguard you are using on these nodes.

#swlist |grep -i serviceguard
also
#what /usr/lbin/cmcld

"There is always some scope for improvement"

Deepak Kr · ‎08-30-2008

Have you changed the disk that was in clusterlock vg?

"There is always some scope for improvement"

Clive Nicholas · ‎08-31-2008

Thanks for your responses and help KumarD.
With the cluster of both nodes up i can manually start the package (the package is named "package") on a node by entering:
# cmrunpkg package
The response is "cmrunpkg completed successfully on all packages specified"
If I quickly enter "cmviewcl -vp package" at this time I see the following response

PACKAGE STATUS STATE PKG_SWITCH NODE
package up running disabled kwamc0s

Policy Parametrs
POLICY NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters
ITEM STATUS MAXRESTARTS RESTARTS NAME
Service up 0 0 cmsmgr
Service up 0 0 ovdm
Service up 0 0 tmn
Service up 0 0 oracle
Service up 0 0 neos
Service up 0 0 lms
Service up 0 0 shut
Subnet up 128.4.0.0
Subnet up 192.168.1.0

Node_Switching_Parameters
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled kwamc0s (current)
Alternate up enabled kwamc1s

.... So at this point everything looks OK (neos and lms are our two custom application programs) and the package looks to be up and running to me.
But aftr a few seconds the message appears:
kwamc0s cmcld: Halting kwamc0s to preserve data integrity
Reason: A crucial package failed.

Clive Nicholas · ‎08-31-2008

#swlist |grep -i serviceguard reports the following:
PHSS_17581 1.0 MC ServiceGuard 11.05 Cummalative Patch

#what /usr/lbin/cmcld reports the following:
HP92453-02A.10.20 HP_UX SYMBOLIC DEBUGGER (END.0) $Revision 7403 $
Build Date: Wed mar 3 14:17:00 PST 1999
Build id: ibld_sg_a1105_patch
A.100.05 Date 99/02/22 PHSS_17581 (SG English/Japanese) PHSS_17483(LM English) PHSS_17484 (LM Japanese) Date: 99/01/13 PHSS_17230
Daemon
Config DB
Cluster Monitor
Command Srv
CommunicationSrv
Config
Dlm
Local Comm
Network Sensor
Package Manager
Remote Comm
API
Service Sesor
Cluster LVM
Status DB
Sync
Util
A.01.01 Resource Monitor API (11_00_AR: Oct 17 1997 09:24:32)

Steven E. Protter · ‎08-31-2008

Shalom,

All software within this installation is out of date and support.

That being said, the package seems to be shut down to to concerns about data integrity on shared storage.

You need to check all lock disks and shared disks configured within packages and configurations for trouble.

dmesg to start, perhaps disk exercise with mstm/cstm/xstm (your choice) to find the root of the problem.

It would not hurt to plan to bring this cluster back into the world of supported OS/system software and perhaps gain help from your HP Software service contract.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Clive Nicholas · ‎08-31-2008

Yes I believe the disk was changed that contained the clusterlock (Note the servers aren't actually in my station... I am helping out a colleage with these servers by remotely logging-in via the network to the servers in the Netherlands).
However, i had wondered about the clusterlock before and I think we have performed the correct procedure to ensure that the clusterlock is recreated correctly.
What I did was this:
cmhaltcl
vgchange -c n vg04
vgchange -c n vg05
vgchange -a y vg04
vgchange -a y vg05
cd /etc/cmcluster

I then did the cmapplyconf command to recompile and distribute the package, before deactivateing the volume groups and running the cluster again.

So, I believe the clusterlocks are OK, but is there a way to check this for sure?

Note: I have remotely logged into kwamc0s with a seperate console and I used the command
#tail -f /var/adm/syslog/syslog.log
to follow events in this log when the package is started on kwamc0s.
I see the line
cmcld: Service PKG*3841 terminated due to an exit(0)

However, this line appears immediately BEFORE the line shown below with the same timestamp which says:
cmcld: Started package package on node kwamc0s
I've no idea why the package is failing and rebooting the node.

Clive Nicholas · ‎08-31-2008

Thanks for your help Steven .. I will try what you suggest to check the shared disks.
I think the problem with support is that the applications being run use HP Openview DM TMN and support for this has been discontinued by HP. The servers help supervise an older submarine cable system and there is no means to update the application software, so the OS has not been touched for many years an we would not be permitted to upgrade it (by managers on high !). These type of systems are installed and commisioned and then we are not permitted to be upgraded/patch further after the initial commissioning/proving period unless there is an exceptional requirement and proof that any work will not affect current custom built application software.

Deepak Kr · ‎08-31-2008

Clive try following regarding cluster lock disk::

Run
# cmapplyconf -v -C /etc/cmcluster/cluster-configfilename

and then start package.

I guess this could be due to missing lock info on newly replaced disk.

or

If any previous backup is exists for lock vg configuration then

#vgcfgrestore -n /dev/lock-vg-name disk-device-name
example

#vgcfgrestore -n /dev/vg_lock /dev/dsk/c4t6d0

Yes, it is absolutely true that serviceguard version you are running is no more supported by HP.

"There is always some scope for improvement"

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Reason for Package failure ?

Reason for Package failure ?