Problem stopping packages/cluster on one node cluster

Co van Berkel · ‎03-09-2006

Hi,
Help...
We use a one node cluster for our disaster environment.
Configure and starting the cluster workes fine.

When we stop a (or all) package(s) the following happens:
- The applications in the package are shutdown;
- The filesystems are umounted;
- The package is shutdown, statted in the package log;
########### Node "": Package halt completed at Fri Mar 10 09:58:
22 MET 2006 ###########

a. But the command (cmhaltpkg ) stays in a wait.
b. cmviewcl says that the package is in status halting and stays in this status:
:/ 1# cmviewcl

CLUSTER STATUS

up

NODE STATUS STATE
up running

PACKAGE STATUS STATE AUTO_RUN NODE

halting halting disabled

halting halting disabled

c. There are processes created.
Looks like that this where the package processes?

d. We have to kill all "cm" processes to stop the cluster.

Can someone tell me what is wrong here?

Regards,
CvB

RAC_1 · ‎03-09-2006

By any chance is package switching enabled?? When package is running, what does
cmviewcl -p "pkg_name" -v say??

There is no substitute to HARDWORK

Co van Berkel · ‎03-09-2006

Hi,
Yes, that is so.
Is this the problem?
Regards,
CvB

Carsten Krege · ‎03-09-2006

If packages are halted, SG starts a shell script through cmsrvassistd which waits for the shell script to finish. If the script does not finish, the package is not regarded as halted by cmsrvassistd. You shoud put a "set -x" in the package scripts to see what it is actually doing during halt and check the process list, perhaps using the command:

# UNIX95=1 ps -efH

to see what the child processes of the package halt scripts are.

You should consider to set the HALT_SCRIPT_TIMEOUT in the package configuration file and rerun cmapplyconf (see SG manual). This would only be a workaround though (and not a really good one either).

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Co van Berkel · ‎03-09-2006

Hi,
What configuration changes, and where, do I have to do to change packages from "switche enabled" to switche disabled"?
Regards,
CvB

RAC_1 · ‎03-09-2006

On single node cluster, what is the use of package switch?? Try chmodpkg -dv "pkg_name".

Also try putting -x in shutdown scripts and where it hangs.

There is no substitute to HARDWORK

Co van Berkel · ‎03-09-2006

Hi,
In all the package logfiles the last line says that the package is halted!
########### Node "": Package halt completed at Fri Mar 10 09:58:
22 MET 2006 ###########

So, it looks to me that the ctrl.sh script is ended!
Also all filesystem are umounted!

What the problem makes more confusing is that it workes now, halting packages, but after some days running it is not possible anymore to halt the packages.

Help...
Regards,
CvB

Co van Berkel · ‎03-09-2006

Hi,
I changed "AUTO_RUN" param of all packages to "NO".

I tested it and now it seems to work. But I tested also "AUTO_RUN YES" and this also did work now?

How can I change "SWITCHING" from "enabled" to "disabled" ?
(see "Node_Switching_Parameters"

# cmviewcl -v
CLUSTER STATUS

up

NODE STATUS STATE
up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/3/0/0 lan1
STANDBY up 0/6/0/0 lan2

PACKAGE STATUS STATE AUTO_RUN NODE

up running disabled

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Subnet up 10.164.0.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled (current)

PACKAGE STATUS STATE AUTO_RUN NODE

up running disabled

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Subnet up 10.164.0.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled (current)

Regards,
CvB

Carsten Krege · ‎03-10-2006

> So, it looks to me that the ctrl.sh
> script is ended!
> Also all filesystem are umounted!

This is not good enough as a criterium to decide whether the script has finished. cmsrvassistd waits for the shell script to finish (see wait(2) manual page). If the script did not finish, cmsrvassistd will not notice. When the problem reoccurs you should check the process list.

It is also possible that cmsrvassistd is unable to communicate with cmcld and update it on the new status. This is probably less likely.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Co van Berkel · ‎03-10-2006

Hi,
The last commando executed in the package ctrl.sh script is:
- print "\n\t########### Node \"$(hostname)\": Package hal
t completed at $(date) ###########"
and
- exit 0

So, for me the script is ended...

Maybe the cmsrvassistd is unable to comunicate with cmcld.

In the syslog.log I miss some messages when it goes wrong.
Normal it should look like:
Mar 10 11:28:19 cmcld: Request from node to halt package .
Mar 10 11:28:19 cmcld: Executing '/etc/cmcluster//ctrl.sh stop'
for package , as service PKG*4609.
Mar 10 11:28:28 cmcld: Processing exit status for service PKG*4609
Mar 10 11:28:28 cmcld: Service PKG*4609 terminated due to an exit(0).
Mar 10 11:28:28 cmcld: Halted package on node .

But if it goes wrong it looks like:
Mar 10 11:28:19 cmcld: Request from node to halt package .
Mar 10 11:28:19 cmcld: Executing '/etc/cmcluster//ctrl.sh stop'
for package , as service PKG*4609.

Then the last three lines are NOT displayed!!

Regards,
CvB

Stephen Doud · ‎03-13-2006

So if Serviceguard is not running properly, (ie package halt completion not being registered with cmcld after some time has passed), the next thing that comes to mind is, what version and patch level of SG is installed?
use "what /usr/lbin/cmcld | grep Date" to get that info.

What has changed recently, or has it always failed in this manner?

Does only one package do this, or multiple?
If only one packages runs normally, create a 2nd package that does nothing (do not modify the package control script), and start it up and run it for a while too before halting it and checking to see if SG will register the package halt completion.

Co van Berkel · ‎03-13-2006

Hi,
1. Version info of "/usr/lbin/cmcld":
A.11.12 Date: 11/10/2000; PATCH: PHSS_22541

2. It fialed always in this manner;

3. All four packages fial in this manner;

Regards,
CvB

Carsten Krege · ‎03-13-2006

SG A.11.12 is an old version. Support for this version ended on Oct 31, 2003. You should really update to something newer.

Because the issue can be reproduced it might make sense to perform a tusc trace of cmsrvassistd before the next cmhaltpkg. tusc can be obtained from http://gatekeep.cs.utah.edu and you should run it with

# tusc -f -p -E -v -T%X -r all -w all -o /tmp/cmsrv.trc

This would show you whether cmsrvassistd keeps waiting (i.e. does not get signaled) or tries to send a message to cmcld.

Not sure this is worth it. You might also consider updating your patch level.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Co van Berkel · ‎03-13-2006

Hi Carsten,
I tried tusc.
The cluster did go DOWN (aborted) the moment I started tusc and did a "cmcheckconf".
All the filesystems and running application where still running.
No cmcld or other MC/SG processes where running any more.
I had to stop all applications and umount all filesystems by hand.
Becose also "netstat -in" showed all package-IP adressen assigned to the lan-card I did
a reboot of the server to be able to start the cluster!

Regards,
CvB

Stephen Doud · ‎03-13-2006

I have been supporting Serviceguard for many years and haven't seen the particular trouble you are seeing. So my suspicion is that something is afoul in the Serviceguard or system files on the server(s). After you rebooted, did you attempt tusc again? Results?

If you can't afford to do do tusc, follow Carsten's advice and consider upgrading Serviceguard. This will give you current files with a current SG patch, replacing potentially corrupt or sick bits.

Carsten Krege · ‎03-14-2006

The issues you are posting getting more and more weird. I don't see a relationship between tusc tracing of cmsrvassistd and running cmcheckconf (although I recommended cmhaltpkg.. ). I verified that the tusc trace would give reasonable output..

If you have gdb (product WDB) installed you might want to check out the core file of cmcld to get a stack trace:

# gdb /usr/lbin/cmcld /var/adm/cmcluster/core
gdb> bt

I suspect that cmsrvassistd died for some reason and could not be restarted by cmcld. What does syslog say?

Do you have by any chance PHNE_28895 cumulative ARPA Transport patch
installed? This patch is known to remove the route to the loopback network (127.0.0.0) and causes that cmsrvassistd cannot talk to cmcld anymore and cmcld cannot restart cmsrvassistd when it dies. It is not a perfect match of what you have seen so far therefore I did not mention it yet. But perhaps you doublecheck.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Co van Berkel · ‎03-14-2006

Hi,

In the output file of the tusc command the following text is displayed:
< 2 7 > M a r 1 4 1 1 : 5 1 : 1 9 c m s r v a s s i s t d
[ 7 9 9 6 ] : L o s t c o n n e c t i o n t o t h e c
l u s t e r d a e m o n .

< 2 7 > M a r 1 4 1 1 : 5 1 : 1 9 c m s r v a s s i s t d
[ 7 9 9 6 ] : L o s t c o n n e c t i o n w i t h S e r
v i c e G u a r d c l u s t e r d a e m o n ( c m c l d )
: S o f t w a r e c a u s e d c o n n e c t i o n a b o
r t

One question here.
What when I do a "cmdeleteconf", after closing down the cluster, and than do a cmcheckconf and cmapplyconf?
Will this clear all "old" configuration files and rebuild it from scrats?

Regards,
CvB

Carsten Krege · ‎03-14-2006

A cmdeleteconf would basically remove the /etc/cmcluster/cmclconfig binary and you would need to rebuild SG from scratch using cmcheckconf/cmapplyconf. I don't think this would help for any of your problems though. I'd say it is more reasonable to use a later version of SG soon.

The tusc trace you show is what I would expect if cmcld dies. cmcld and cmsrvassistd have a TCP connection open. When this goes down the daemons will notice this and cmcld would try to re-establish it (of course if it is just the death of the TCP connection and cmcld is still alive).

Well, you already said that cmcld died. The question is why? Because of the tusc trace?? Only a stack trace of cmcld and/or famous last words of cmcld from syslog would tell us. If you do not have gdb you can try using adb

# adb /usr/lbin/cmcld /var/adm/cmcluster/core
adb> $c

This does not always work though. gdb is better. I think this would be interesting because I could imagine that the problems reported are related (cmcld death, packages fail to halt).

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Co van Berkel · ‎03-14-2006

Hi,

Hereby the syslog.log output of the moment that cmcld died:
Mar 14 11:51:16 cmcld: Assertion failed: (tsb_tmp).tsb_low <= TICKS_PER
_MAX_USEC, file: timers.c, line: 792
Mar 14 11:51:19 cmlvmd: Could not read messages from /usr/lbin/cmcld: S
oftware caused connection abort
Mar 14 11:51:19 cmsrvassistd[7996]: Lost connection with ServiceGuard c
luster daemon (cmcld): Software caused connection abort

Also the cmcld messages after startup in the syslog.log:
Mar 14 12:31:37 cmclconfd[1794]: Executing "/usr/lbin/cmcld" for node
Mar 14 12:31:38 cmcld: Daemon Initialization - Maximum number of packages supported for this incarnation is 5.
Mar 14 12:31:38 cmcld: Reserving 1748 Kbytes of memory and 49 threads
Mar 14 12:31:38 cmcld: The maximum # of concurrent local connections to the daemon that will be supported is 19.
Mar 14 12:31:38 cmcld: Warning. No cluster lock is configured.
Mar 14 12:31:39 cmcld: cmgmsd_init: SG CLUSTER ; return 1
Mar 14 12:31:39 cmcld: Starting cluster management protocols.
Mar 14 12:31:39 cmcld: Attempting to form a new cluster
Mar 14 12:31:39 cmcld: Turning off safety time protection since the cluster
Mar 14 12:31:39 cmcld: now consists of a single node. If ServiceGuard
Mar 14 12:31:39 cmcld: fails, this node will not automatically halt
Mar 14 12:31:39 cmcld: 1 nodes have formed a new cluster, sequence #1
Mar 14 12:31:39 cmcld: The new active cluster membership is: (id=1)
Mar 14 12:32:24 cmcld: Request from node to start package on node .
Mar 14 12:32:24 cmcld: Executing '/etc/cmcluster//ctrl.sh start' for package , as service PKG*15876.
Mar 14 12:32:39 cmcld: Processing exit status for service PKG*15876
Mar 14 12:32:39 cmcld: Service PKG*15876 terminated due to an exit(0).
Mar 14 12:32:39 cmcld: Started package on node .
Mar 14 12:32:53 cmcld: Request from node to start package on node .
Mar 14 12:32:53 cmcld: Executing '/etc/cmcluster//ctrl.sh start' for package , as service PKG*4609.
Mar 14 12:32:59 cmcld: Processing exit status for service PKG*4609
Mar 14 12:32:59 cmcld: Service PKG*4609 terminated due to an exit(0).
Mar 14 12:32:59 cmcld: Started package on node .
Mar 14 12:33:06 cmcld: Request from node to start package on node .
Mar 14 12:33:06 cmcld: Executing '/etc/cmcluster//ctrl.sh start' for package , as service PKG*4610.
Mar 14 12:33:19 cmcld: Processing exit status for service PKG*4610
Mar 14 12:33:19 cmcld: Service PKG*4610 terminated due to an exit(0).
Mar 14 12:33:19 cmcld: Started package on node .
Mar 14 12:33:29 cmcld: Request from node to start package on node .
Mar 14 12:33:29 cmcld: Executing '/etc/cmcluster//ctrl.sh start' for package , as service PKG*4611.
Mar 14 12:35:39 cmcld: Processing exit status for service PKG*4611
Mar 14 12:35:39 cmcld: Service PKG*4611 terminated due to an exit(0).
Mar 14 12:35:39 cmcld: Started package on node .

Regards,
CvB

Co van Berkel · ‎03-14-2006

Hi Carsten,

No patch PHNE_28895 isn't installed.
We run on a HP-UX 11.00 server.

Regards,
CvB

Carsten Krege · ‎03-14-2006

> No patch PHNE_28895 isn't installed.
> We run on a HP-UX 11.00 server.

This makes perfectly sense, because SG A.11.12 was only supported on 11.00 ... ok.

> Mar 14 11:51:16 cmcld: Assertion
> failed: (tsb_tmp).tsb_low <= TICKS_PER
> _MAX_USEC, file: timers.c, line: 792

This is the key message. An assertion has failed (i.e. a specific condition in the code that was expected to be true was really false and caused cmcld to abort).

I think this might be fixed in PHSS_23373 for SG A.11.12. Search the patch text for "TICKS_PER_MAX_USEC". This message can also be caused by a system hang though that starved out cmcld from CPU. Judging from the fact that you only run a 1-node cluster, this might be even more likely.
In no way I'd expect the tusc trace to be responsible for the cmcld abort.

To fix the most prominent system hangs it is advisable to call your support rep and to ask him to prepare a patch bundle that contains kernel patches that fix kernel hangs (LVM, Filesystem, SCSI, FC, ARPA, LAN, process management.. )

A system hang could potentially also explain the package halt problems. If cmsrvassistd is starved out it won't be able to deliver the return value of the package halt scripts back to cmcld.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Problem stopping packages/cluster on one node cluster

Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster

Re: Problem stopping packages/cluster on one node cluster