HELP!! Weired Serviceguard behaviours...

psyduck™ · ‎10-01-2007

[Enviroment]
- rp5470 2 node cluster (Active/Standby)
- HP-UX 11.11 (2003)
- NO OPS/RAC
- NO APA
- Serviceguard 11.12 (/w 2 or 3 patches)
- EMC CX300 (shared disk)
- configured package = 1 (pkg1)
- 1 data/heartbeat, 1 heartbeat, 1 standby LANs

[Symptoms]
- After fail over the pkg1 to node 2, node 1 can NOT join the cluster again.
- After rebooting both nodes, they can NOT form the cluster.
- Error messages & other serviceguard related messages are logged only in syslog.log file.
(pkg1.log file's last entry is 2007.March on both nodes)
- cmcheckconf, cmapplyconf results NO errors and warnings but failed to forming the cluster
- cmrunnode, cmruncl produces same log messages as follows

Sep 17 15:17:26 hp1 CM-CMD[4576]: cmmodpkg -e pkg1
Sep 17 15:17:53 hp1 CM-CMD[4579]: cmrunnode -v hp1
Sep 17 15:17:53 hp1 cmclconfd[4582]: Request from root on node hp1 to start the cluster on this node.
Sep 17 15:17:53 hp1 cmclconfd[4582]: Executing "/usr/lbin/cmcld" for node hp1
Sep 17 15:17:53 hp1 cmcld: Logging level changed to level 0.
Sep 17 15:17:53 hp1 cmcld: Daemon Initialization - Maximum number of packages supported for this incarnation is 2.
Sep 17 15:17:53 hp1 cmcld: Global Cluster Information:
Sep 17 15:17:53 hp1 cmcld: Heartbeat Interval is 1 seconds.
Sep 17 15:17:53 hp1 cmcld: Node Timeout is 6 seconds.
Sep 17 15:17:53 hp1 cmcld: Network Polling Interval is 2 seconds.
Sep 17 15:17:53 hp1 cmcld: Logging level changed to level 0.
Sep 17 15:17:53 hp1 cmcld: Auto Start Timeout is 600 seconds.
Sep 17 15:17:53 hp1 cmcld: Information Specific to node hp1:
Sep 17 15:17:53 hp1 cmcld: Cluster lock disk: /dev/dsk/c6t0d2.
Sep 17 15:17:53 hp1 cmcld: lan0 0x00306e48960c 192.168.1.81 bridged net:1
Sep 17 15:17:53 hp1 cmcld: lan2 0x001083f6ff93 10.0.0.1 bridged net:2
Sep 17 15:17:53 hp1 cmcld: lan1 0x001083fbeab3 standby bridged net:1
Sep 17 15:17:53 hp1 cmcld: Heartbeat Subnet: 192.168.1.0
Sep 17 15:17:53 hp1 cmcld: Heartbeat Subnet: 10.0.0.0
Sep 17 15:17:53 hp1 cmcld: The maximum # of concurrent local connections to the daemon that will be supported is 2018.
Sep 17 15:17:53 hp1 cmcld: Service cmlogd terminated due to an exit(118).
Sep 17 15:17:53 hp1 cmcld: Automatically restarted service cmlogd for the 1st time after failure.
Sep 17 15:17:53 hp1 cmcld: Service cmlvmd terminated due to an exit(118).
Sep 17 15:17:53 hp1 cmcld: Automatically restarted service cmlogd for the 2nd time after failure.
.....

Sep 17 15:18:53 hp1 cmcld: Automatically restarted service cmlogd for the 857th time after failure.
Sep 17 15:18:53 hp1 cmcld: Timedout waiting for LVM daemon
Sep 17 15:18:53 hp1 cmcld: Daemon exiting to preserve data integrity
Sep 17 15:18:53 hp1 cmcld: Reason: LVM daemon did not start
Sep 17 15:18:53 hp1 cmsrvassistd[4586]: The cluster daemon aborted our connection.
Sep 17 15:18:53 hp1 cmcld: Service cmlogd terminated due to an exit(118).
Sep 17 15:18:53 hp1 above message repeats 856 times
Sep 17 15:18:53 hp1 cmsrvassistd[4586]: Lost connection with ServiceGuard cluster daemon (cmcld): Software caused connection abort
Sep 17 15:18:53 hp1 cmclconfd[5405]: The cluster daemon aborted our connection.

--------------------------------------------------------

Now the system is running in none HA mode.
How can I solve this problem...?
Any help will be appreciated..

Thank you

melvyn burnard · ‎10-01-2007

Firstly, SG A.11.12 has been out of support for almost four years.
Secondly, SG A.11.12 was NOT supported on HP-UX 11.11
Thirdly, the EMC CX series are unsupported in Serviceguard, although they are known to work.
Fourth item, the CX series are completley unsupported as a Cluster Lock disc.
And last point, are you SURE you have SG A.11.12? your syslog says differently

Do:
what /usr/lbin/cmcld on both nodes

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Matti_Kurkela · ‎10-01-2007

These messages indicate that ServiceGuard daemons are not starting up properly:

Sep 17 15:18:53 hp1 cmcld: Automatically restarted service cmlogd for the 857th time after failure.
Sep 17 15:18:53 hp1 cmcld: Timedout waiting for LVM daemon
Sep 17 15:18:53 hp1 cmcld: Daemon exiting to preserve data integrity
Sep 17 15:18:53 hp1 cmcld: Reason: LVM daemon did not start

Are there any indications of hardware problems? Check "dmesg" listing for anything unusual, like SCSI errors.
What was the reason of the original failover?

ServiceGuard 11.12 was for HP-UX 11.00 only, as shown in the release notes:

http://docs.hp.com/en/B3935-90063/ch01s04.html#d0e86

You're running a software configuration that is not supported now, and apparently was NEVER supported.
The question is not so much "why it stopped working now?" but "how on earth did it work at all up to this point?"

ServiceGuard is very much OS version specific: when you upgrade the OS version, you should check the ServiceGuard version requirements too, and upgrade ServiceGuard when necessary.

The ServiceGuard documentation has specific instructions about how to perform OS and/or ServiceGuard upgrades while minimizing downtime. I guess the system might have been upgraded from 11.00 to 11.11, but ServiceGuard was not upgraded to match the OS version.

However, these instructions may not help you now, because they assume you're starting from a fully-functional configuration. You should get a version of ServiceGuard that is supported with your OS version (11.16), and use that to re-create your cluster.

You should use cmgetconf on your current set-up to get up-to-date versions of both cluster and package configuration files in ASCII form. You should also get a copy of the current package control scripts: you should re-create them using the control script template of your new ServiceGuard version.

As you obviously have some downtime ahead, you should get up to date with OS patches at the same time. At least install the most recent Quality Pack (June 2007).

MK

MK

Steven E. Protter · ‎10-01-2007

Shalom,

I know melvn will ask for a link, however come EMC CX models are now supported by SG, though not on a SG release that old.

My guess as to your actual problem is that the SG configuration files are not consistent on both nodes.

Check the size date and time stamp, especially on the binary files cmquerycl/cmchekconf/cmapplyconf produces.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

melvyn burnard · ‎10-01-2007

If the EMC CX3000 array is a Clariion array, they are NOT supported for Serviceguard from a Serviceguard perspective.
There are links indicating that they are, but the Division has not tested these devices.
There is now a customer viewable version of this information available externally at: http://www.hp.com/products1/unixserverconnectivity/mass_storage_devices.html.

There you will find a matrix of third party mass storage devices supported on HP-UX, which also indicates whether or not the device is supported for Serviceguard. Please note that the mass storage devices listed as supported with Serviceguard on this web page are not fully-supported in that HP Serviceguard Labs have not tested them and cannot guarantee that they meet all of our HA requirements.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

psyduck™ · ‎10-01-2007

Sorry.. my mistakes..
Serviceguard is 11.15
and shared disk is CX400.

Package fail over reason is ..
cron command failed with rc=1 message(cron.log) on mid September (before that day cron works fine), so system admin take over the pkg to node2 and run same cron.
After test, he wanted to take back the pkg, and the problem disposed...

------------

cluster configuration file is same.
I recompiled cluster binary files...

Thank you

F Verschuren · ‎10-02-2007

sounds like the cluster is a litle bit out of shape, I have seen several lookalike problems, the solutions can be:

1 ad a + + to the /.rhosts and see if the problem was caused by autorisation (do not forget to replase them by securder rules!)
2 there is a routing clean out the routing that make trafic going on the wrong vlans.
3 make sure all IP adressed are in /etc/hosts.
5 check if cmgetconf and cmcheckconf still works...
4 the cluster config fiel is corupt:
move config file, reboot server, recreate the cluster.

F Verschuren · ‎10-02-2007

6 runa a cmrunnode -f system namen form the other node....

John Bigg · ‎10-02-2007

Both cmlvmd and cmlogd are exiting with a status of 118. This exit status indicates that the daemon failed. Given the other messages I think the most likely cause is that the daemons failed to make a connection back to cmcld.

When Serviceguard starts the main daemon cmcld is started by cmclconfd when cmrunnode is executed. cmcld then starts a number of helper processes such as cmlvmd and cmlogd. These daemons once started make a connection back to cmcld over localhost. It is this which is failing.

The exact cause of the failure is going to be hard to determine on 11.15 since this does not allow logging to be turned up on these daemons. Note this release is unsupported and has been since last year. If you were running 11.17 or 11.18 you could have simply turned up logging on cmlvmd to see exactly what was going wrong. You should probably upgrade to one of these releases anyway since these are the only ones currently being actively patched.

Sorry, not sure I can suggest what you can check other than making sure the usual things such as /etc/services, permissions, networking (especially localhost communications) are correct. Can you telnet to localhost?

If this were me I would replace cmlvmd with a wrapper script which ran a tusc of cmlvmd to see if there are any particular system calls which are failing.

If you cannot solve this with this info I suggest you contact HP support.

psyduck™ · ‎10-04-2007

Thanks for your responses...

I checked network connectivities.
(# telnet localhost port)

hacl-hb 5300/tcp # High Availability (HA) Cluster heartbeat
hacl-gs 5301/tcp # HA Cluster General Services
hacl-cfg 5302/tcp # HA Cluster TCP configuration
hacl-cfg 5302/udp # HA Cluster UDP configuration
hacl-probe 5303/tcp # HA Cluster TCP probe
hacl-probe 5303/udp # HA Cluster UDP probe
hacl-local 5304/tcp # HA Cluster Commands
hacl-test 5305/tcp # HA Cluster Test
hacl-dlm 5408/tcp # HA Cluster distributed lock manager

Among these ports only 5302 & 5303 ports are allowed and the other ports are unable to connect (connection refused).
Is this a normal condition?

I checked /etc/services & /etc/inetd.conf files but no errors... no special characters are there.

psyduck™ · ‎11-01-2007

OK..
We decided to reinstall OS & SG.
OS acts somewhat abnormally, and we can't figure out what is the problem...

Thanks for your help..
Thans again.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

HELP!! Weired Serviceguard behaviours...

HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...

Re: HELP!! Weired Serviceguard behaviours...