- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- HELP!! Weired Serviceguard behaviours...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 07:24 PM
10-01-2007 07:24 PM
- rp5470 2 node cluster (Active/Standby)
- HP-UX 11.11 (2003)
- NO OPS/RAC
- NO APA
- Serviceguard 11.12 (/w 2 or 3 patches)
- EMC CX300 (shared disk)
- configured package = 1 (pkg1)
- 1 data/heartbeat, 1 heartbeat, 1 standby LANs
[Symptoms]
- After fail over the pkg1 to node 2, node 1 can NOT join the cluster again.
- After rebooting both nodes, they can NOT form the cluster.
- Error messages & other serviceguard related messages are logged only in syslog.log file.
(pkg1.log file's last entry is 2007.March on both nodes)
- cmcheckconf, cmapplyconf results NO errors and warnings but failed to forming the cluster
- cmrunnode, cmruncl produces same log messages as follows
Sep 17 15:17:26 hp1 CM-CMD[4576]: cmmodpkg -e pkg1
Sep 17 15:17:53 hp1 CM-CMD[4579]: cmrunnode -v hp1
Sep 17 15:17:53 hp1 cmclconfd[4582]: Request from root on node hp1 to start the cluster on this node.
Sep 17 15:17:53 hp1 cmclconfd[4582]: Executing "/usr/lbin/cmcld" for node hp1
Sep 17 15:17:53 hp1 cmcld: Logging level changed to level 0.
Sep 17 15:17:53 hp1 cmcld: Daemon Initialization - Maximum number of packages supported for this incarnation is 2.
Sep 17 15:17:53 hp1 cmcld: Global Cluster Information:
Sep 17 15:17:53 hp1 cmcld: Heartbeat Interval is 1 seconds.
Sep 17 15:17:53 hp1 cmcld: Node Timeout is 6 seconds.
Sep 17 15:17:53 hp1 cmcld: Network Polling Interval is 2 seconds.
Sep 17 15:17:53 hp1 cmcld: Logging level changed to level 0.
Sep 17 15:17:53 hp1 cmcld: Auto Start Timeout is 600 seconds.
Sep 17 15:17:53 hp1 cmcld: Information Specific to node hp1:
Sep 17 15:17:53 hp1 cmcld: Cluster lock disk: /dev/dsk/c6t0d2.
Sep 17 15:17:53 hp1 cmcld: lan0 0x00306e48960c 192.168.1.81 bridged net:1
Sep 17 15:17:53 hp1 cmcld: lan2 0x001083f6ff93 10.0.0.1 bridged net:2
Sep 17 15:17:53 hp1 cmcld: lan1 0x001083fbeab3 standby bridged net:1
Sep 17 15:17:53 hp1 cmcld: Heartbeat Subnet: 192.168.1.0
Sep 17 15:17:53 hp1 cmcld: Heartbeat Subnet: 10.0.0.0
Sep 17 15:17:53 hp1 cmcld: The maximum # of concurrent local connections to the daemon that will be supported is 2018.
Sep 17 15:17:53 hp1 cmcld: Service cmlogd terminated due to an exit(118).
Sep 17 15:17:53 hp1 cmcld: Automatically restarted service cmlogd for the 1st time after failure.
Sep 17 15:17:53 hp1 cmcld: Service cmlvmd terminated due to an exit(118).
Sep 17 15:17:53 hp1 cmcld: Automatically restarted service cmlogd for the 2nd time after failure.
.....
Sep 17 15:18:53 hp1 cmcld: Automatically restarted service cmlogd for the 857th time after failure.
Sep 17 15:18:53 hp1 cmcld: Timedout waiting for LVM daemon
Sep 17 15:18:53 hp1 cmcld: Daemon exiting to preserve data integrity
Sep 17 15:18:53 hp1 cmcld: Reason: LVM daemon did not start
Sep 17 15:18:53 hp1 cmsrvassistd[4586]: The cluster daemon aborted our connection.
Sep 17 15:18:53 hp1 cmcld: Service cmlogd terminated due to an exit(118).
Sep 17 15:18:53 hp1 above message repeats 856 times
Sep 17 15:18:53 hp1 cmsrvassistd[4586]: Lost connection with ServiceGuard cluster daemon (cmcld): Software caused connection abort
Sep 17 15:18:53 hp1 cmclconfd[5405]: The cluster daemon aborted our connection.
--------------------------------------------------------
Now the system is running in none HA mode.
How can I solve this problem...?
Any help will be appreciated..
Thank you
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 08:20 PM
10-01-2007 08:20 PM
Re: HELP!! Weired Serviceguard behaviours...
Secondly, SG A.11.12 was NOT supported on HP-UX 11.11
Thirdly, the EMC CX series are unsupported in Serviceguard, although they are known to work.
Fourth item, the CX series are completley unsupported as a Cluster Lock disc.
And last point, are you SURE you have SG A.11.12? your syslog says differently
Do:
what /usr/lbin/cmcld on both nodes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 08:45 PM
10-01-2007 08:45 PM
Re: HELP!! Weired Serviceguard behaviours...
Sep 17 15:18:53 hp1 cmcld: Automatically restarted service cmlogd for the 857th time after failure.
Sep 17 15:18:53 hp1 cmcld: Timedout waiting for LVM daemon
Sep 17 15:18:53 hp1 cmcld: Daemon exiting to preserve data integrity
Sep 17 15:18:53 hp1 cmcld: Reason: LVM daemon did not start
Are there any indications of hardware problems? Check "dmesg" listing for anything unusual, like SCSI errors.
What was the reason of the original failover?
ServiceGuard 11.12 was for HP-UX 11.00 only, as shown in the release notes:
http://docs.hp.com/en/B3935-90063/ch01s04.html#d0e86
You're running a software configuration that is not supported now, and apparently was NEVER supported.
The question is not so much "why it stopped working now?" but "how on earth did it work at all up to this point?"
ServiceGuard is very much OS version specific: when you upgrade the OS version, you should check the ServiceGuard version requirements too, and upgrade ServiceGuard when necessary.
The ServiceGuard documentation has specific instructions about how to perform OS and/or ServiceGuard upgrades while minimizing downtime. I guess the system might have been upgraded from 11.00 to 11.11, but ServiceGuard was not upgraded to match the OS version.
However, these instructions may not help you now, because they assume you're starting from a fully-functional configuration. You should get a version of ServiceGuard that is supported with your OS version (11.16), and use that to re-create your cluster.
You should use cmgetconf on your current set-up to get up-to-date versions of both cluster and package configuration files in ASCII form. You should also get a copy of the current package control scripts: you should re-create them using the control script template of your new ServiceGuard version.
As you obviously have some downtime ahead, you should get up to date with OS patches at the same time. At least install the most recent Quality Pack (June 2007).
MK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 08:52 PM
10-01-2007 08:52 PM
Re: HELP!! Weired Serviceguard behaviours...
I know melvn will ask for a link, however come EMC CX models are now supported by SG, though not on a SG release that old.
My guess as to your actual problem is that the SG configuration files are not consistent on both nodes.
Check the size date and time stamp, especially on the binary files cmquerycl/cmchekconf/cmapplyconf produces.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 09:07 PM
10-01-2007 09:07 PM
Re: HELP!! Weired Serviceguard behaviours...
There are links indicating that they are, but the Division has not tested these devices.
There is now a customer viewable version of this information available externally at: http://www.hp.com/products1/unixserverconnectivity/mass_storage_devices.html.
There you will find a matrix of third party mass storage devices supported on HP-UX, which also indicates whether or not the device is supported for Serviceguard. Please note that the mass storage devices listed as supported with Serviceguard on this web page are not fully-supported in that HP Serviceguard Labs have not tested them and cannot guarantee that they meet all of our HA requirements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2007 09:21 PM
10-01-2007 09:21 PM
Re: HELP!! Weired Serviceguard behaviours...
Serviceguard is 11.15
and shared disk is CX400.
Package fail over reason is ..
cron command failed with rc=1 message(cron.log) on mid September (before that day cron works fine), so system admin take over the pkg to node2 and run same cron.
After test, he wanted to take back the pkg, and the problem disposed...
------------
cluster configuration file is same.
I recompiled cluster binary files...
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-02-2007 06:40 PM
10-02-2007 06:40 PM
Re: HELP!! Weired Serviceguard behaviours...
1 ad a + + to the /.rhosts and see if the problem was caused by autorisation (do not forget to replase them by securder rules!)
2 there is a routing clean out the routing that make trafic going on the wrong vlans.
3 make sure all IP adressed are in /etc/hosts.
5 check if cmgetconf and cmcheckconf still works...
4 the cluster config fiel is corupt:
move config file, reboot server, recreate the cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-02-2007 06:43 PM
10-02-2007 06:43 PM
Re: HELP!! Weired Serviceguard behaviours...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-02-2007 08:48 PM
10-02-2007 08:48 PM
SolutionWhen Serviceguard starts the main daemon cmcld is started by cmclconfd when cmrunnode is executed. cmcld then starts a number of helper processes such as cmlvmd and cmlogd. These daemons once started make a connection back to cmcld over localhost. It is this which is failing.
The exact cause of the failure is going to be hard to determine on 11.15 since this does not allow logging to be turned up on these daemons. Note this release is unsupported and has been since last year. If you were running 11.17 or 11.18 you could have simply turned up logging on cmlvmd to see exactly what was going wrong. You should probably upgrade to one of these releases anyway since these are the only ones currently being actively patched.
Sorry, not sure I can suggest what you can check other than making sure the usual things such as /etc/services, permissions, networking (especially localhost communications) are correct. Can you telnet to localhost?
If this were me I would replace cmlvmd with a wrapper script which ran a tusc of cmlvmd to see if there are any particular system calls which are failing.
If you cannot solve this with this info I suggest you contact HP support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-04-2007 05:17 PM
10-04-2007 05:17 PM
Re: HELP!! Weired Serviceguard behaviours...
I checked network connectivities.
(# telnet localhost port)
hacl-hb 5300/tcp # High Availability (HA) Cluster heartbeat
hacl-gs 5301/tcp # HA Cluster General Services
hacl-cfg 5302/tcp # HA Cluster TCP configuration
hacl-cfg 5302/udp # HA Cluster UDP configuration
hacl-probe 5303/tcp # HA Cluster TCP probe
hacl-probe 5303/udp # HA Cluster UDP probe
hacl-local 5304/tcp # HA Cluster Commands
hacl-test 5305/tcp # HA Cluster Test
hacl-dlm 5408/tcp # HA Cluster distributed lock manager
Among these ports only 5302 & 5303 ports are allowed and the other ports are unable to connect (connection refused).
Is this a normal condition?
I checked /etc/services & /etc/inetd.conf files but no errors... no special characters are there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-01-2007 02:51 PM
11-01-2007 02:51 PM
Re: HELP!! Weired Serviceguard behaviours...
We decided to reinstall OS & SG.
OS acts somewhat abnormally, and we can't figure out what is the problem...
Thanks for your help..
Thans again.