Serviceguard stops package - reboots everything

blake_15 · ‎05-18-2006

Hello,

I have a two node cluster running two packages, an Oracle package for the DB server and an application package. Node 1 is running application and node 2 runs Oracle.

A few days ago, SG seems to have attempted to stop the application package for what appears to be no reason.

Things got a bit strange at that point as SG had some issues unmounting filesystems. I can see it trying to run fuser, which is normal, but instead of unmounting, I see the following in package control log:

umount: cannot lock /etc/mnttab; still trying ...

SG then attempts to run the package on the other node (running DB server package) which worked according to the package log. At this point, node 1 cluster daemon exits and the system boots. Node 2, which has just loaded app package reboots about 2 minutes later. This reboot however is not recorded in any logs that I can see. It really looks at though someone just pulled the power as nothing was shutdown properly.

Everything started up ok.

Our app has pretty extensive logging and from what I see it was functioning fine at the time. No issues with networking either, and no hardware or power problems.

I am running HPUX 11.0, SG is 11.14. Patching and updates are at Spring 2005 release and are identical on the two servers. HP has asked that I install a couple SG patches (PHSS_32260 and PHSS_32261).

HP tech support can't provide any explanations that work, especially with Node 2. They are of the opinion that someone manually stopped the app package, and later somehow pulled power on node 2. This is pretty much impossible, nobody was logged in at the time. HP suggested that I up the logging on SG, and install a couple SG patches (PHSS_32260 and PHSS_32261).

Anyone have similar experience where SG stops a package for no reason?

thanks,
blake

Steven E. Protter · ‎05-18-2006

Shalom,

I've heard similar stories in the past. What happened in most of them was some SG criteria like heartbeat was lost long enough to trigger a TOC of one of the nodes.

It went like this. SG was set up with heartbeat on the same lan as all the other servers. Gradually as more servers were added congestion increased. One day the setup become unstable.

There should be evidence on the logs to prove what happened.

If the package was manually stopped, there should be a login and keyboard log to prove or disprove it.

I'd run fsck on those filesystems soon, there may be a problem. Also a disk issue could trigger this. I'd check shared and non-shared disks.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

melvyn burnard · ‎05-18-2006

SG will not normally stop a package for "no reason"
I would investigate the package log for that package on node 1, as well as tie up the timestamps in the OLDsyslog.log file to see if there is any correlation between events in the syslog/package log.
I would strongly suspect that SG halted the package due tpo the loss of a service or something like that.

Node 1 rebooting would usually record something in the shutdownlog, even if it were a TOC. If something else were to have caused an issue, for example where cmlvmd failed, then SG can issue a reboot -q command, which of course means that the reboot is NOT logged anywhere.

Also review node2 syslog at around the time of the "failure" to see if anything were spotted by node2.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

blake_15 · ‎05-19-2006

Thanks for the suggestions. I've ruled out loss of heartbeat as a cause. Both nodes connect to EVA3000 SAN. I don't see any problems there, or on local disks.

The HP engineer that was helping me pointed out the following line in node 1 syslog, which corresponds with the beginning of our problems:

May 14 06:56:10 hp3 CM-CMD[23195]: /usr/sbin/cmhaltnode -vf

This is the app package being stopped by 'CM-CMD', which means that the package was stopped from command line. If SG stops the package, it should be logged under 'CMCLD', the cluster daemon. Is this true?

I've attached syslog from node 1, and package logs. Unfortunately, I don't have node 2 syslog. First line support rebooted the box in a panic and didn't save OLDsyslog.log.

thanks,
blake

Matti_Kurkela · ‎05-19-2006

May 14 06:56:10 hp3 CM-CMD[23195]: /usr/sbin/cmhaltnode -vf

Looks like somebody was not just halting the package, but halting the entire node. That would cause the cluster to transition to single-node state. It should not have caused the node to reboot, just to leave the cluster.

Remenber that you can run the Java-based ServiceGuard Manager software on another host. If you know the root password to cluster's member(s), you can connect to the cluster and use the GUI to issue run/halt commands to packages and nodes. I haven't checked, but I believe the use of ServiceGuard Manager does not show up in the logs as a normal login.

The message
umount: cannot lock /etc/mnttab; still trying ...
suggests that something was holding /etc/mnttab open (or holding the lockfile used by mount/umount when changing /etc/mnttab). If this delayed the umount so that ServiceGuard ended up using more drastic measures, this might have contributed to the problem.

I agree with Melvyn, it would be interesting to see the entries /var/adm/shutdownlog at the time of this problem. If there are messages like "Reboot after panic: SafetyTimer expired, isr.ior = ...", the reboot is caused by ServiceGuard trying to avoid a possible split-brain situation in the cluster.

MK

blake_15 · ‎05-19-2006

Shutdownlog (attached) recorded no entries on either system. There was no crash dump recorded either in /var/adm/crash. If there was a panic, I can't find any evidence of it.

We don't run ServiceGuard GUI so shutdown couldn't have happened that way.

thanks for your help.

Albert_31 · ‎05-22-2006

Hello Blake,

Has any one checked the GSP/MP logs to check the reason why the system went down..

Since if it is a normal / panic reboot.. it will clearly log it in the GSP/MP logs else it there was a sudden loss of power there will no messages whatsoever in the MP/GSP we will only get the customary startup messages..

You can send me the logs if you want me to check it.

mp> sl -> error logs -> select the dump option

mp> sl -> forward progress logs -> select the dump option.

regarding who gave the halt command will get back to you soon..

How is node_failfast set to ..collect the cluster binary info using cmgetconf and we can check that as well..

regards

Albert..

blake_15 · ‎05-23-2006

Hi Albert,

Thanks for taking the time to look into this.

I have attached a zip file of logs. Basically, it has everything that I had gathered including shutdownlog, package control logs, syslog (from one node), some GPS error messages, and cluster config output. Note that the Oracle server (HP4) doesn't have a syslog from the time of the error. It was rebooted manually, and the OLDsyslog.log was lost unfortunately.

I not sure how to dump GPS messages, so I've only included the one error message that came up on each server.

Node failfast is set to 'no' for both packages. If I can provide any more usefull info, please let me know.

Albert_31 · ‎05-23-2006

Hello Blake,

Just to update you, have checked up and found that under no circumstances will SG/SG daemons will issue the "/usr/sbin/cmhaltnode -vf " as it is.

will go thru the logs and update you soon.
regards

albert
How to collect the GSP messages

a) depends on which tool you use for telnet.. ex putty/reflection etc.

b)enable logging in the application for putty.. go to session (you get it when you open putty the first time)- select logging - select only printable characters - give a file name.

c) telnet to GSP
MP> sl > e (for error log) > d (for dump entire log)
mp> sl > f (for forward logs) > d (for dump entire logs)
mp> sl > f (for forward logs) > d (for dump entire logs) --> second time since half the data will be in buffer.

d) disconnect from mp and close the putty session. send me the file.

Note the procedure is same for any application expect if you use the cmd on windows where it will be very hard to collect the data.

blake_15 · ‎05-24-2006

Hi Albert,

In GPS at sl (show logs) prompt I don't have an option to dump, only view or setup a filter. Also, I don't have an option for 'forward logs', I do have incoming/activity/error/currnet boot/last boot.
I did capture all error and activity GPS logs from each server. Logs are attached. Hope they are of some use. If you need any other GPS logs, just let me know.

Very confused about how 'cmhaltnode' command was run. I have HP support escalating the issue, but they have yet to give me any feedback. I'll let you know if they come back with anything.

thanks again for helping me out!
blake

Albert_31 · ‎05-25-2006

Hello Blake,

Thank you for the logs. Have analyzed the logs, please find below my observations and conclusions

Observation :-

a) The GSP does not log any messages on both the servers for a long time.
b) On 14/5/2006, 11:00 AM GMT time, both the servers hp3 and hp4 log the below message.

DATE: 05/14/2006 TIME: 10:59:05
ALERT LEVEL: 1 = Information only, no action required

SOURCE: 6 = platform
SOURCE DETAIL: 6 = service processor SOURCE ID: 0
PROBLEM DETAIL: 1 = selftest result

There is a selftest message of the GSP, logged when ever there is a gsp reset or power resumed to the system.

c) After that we see normal messages about system getting initialized etc..
d) Again at 16/05/2006 at 0900hrs GMT, we again see an error regarding Power being lost/some power supply issue..event 6 on hp3 and event11 on hp4..which again gets restored in a couple of minutes. This was noticed last on 16/05/2006.. after which there is no such logs.

Conclusion
a) Hence we can confidently confirm that at the some time before 1100hrs GMT time, both the servers lost the power temporarily and regained it again at 1100hrs GMT
b) There was some kind of issue with the power. Am not sure what it is and whether it is still present. You can check the /var/opt/resmon/log/event.log file regularly for any such power loss or ask HP for an explanation.

Let me know if you have any clarifications on the above.

warm regards

Albert

blake_15 · ‎05-25-2006

Thanks very much for your work Albert. I've had the electronics mondule replaced in the UPS since.

Albert_31 · ‎05-25-2006

hello blake

Looks like you caught the culprit :)

albert

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Serviceguard stops package - reboots everything

Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything

Re: Serviceguard stops package - reboots everything