Re: How to kill service monitor scripts when package has failed

Joe Geiger · ‎06-25-2012

We have had this scenario come up a few times within the last year: Basically, we will have a package that has failed (as opposed to halted) but for which all services are still running. We have found that we simply cannot halt the services for the package because the service monitoring script(s) continue to run (even if we try to kill them!). That is, every time we kill a service, the service monitor script -- which is still running despite the package being "failed" -- restarts the service. Unfortunately, when this occurs, we have found (so far) that the onyl way out is a reboot. If other packages are OK, too bad -- we have to take them down (or fail them over) for a reboot. Does anyone have any similar experience? More importantly, does anyone know how to halt a services for a "failed" package?

We are runnign SG 11.19 on 11i V3 in a two-node cluster (rx6600's). The root cause (we think) that got us into this latest mess was that the network folks were doing some work over the weekend and all of lan connections went down. We have had two packages configured to monitor just the one subnet -- these packages failed. Another package (in which we were not monitoring the subnet) remained up.

Ken Grabowski · ‎06-29-2012

Your saying that as root you issue a kill -9 on the monitor process and it will not be killed? The only reason I know for that would be an IO hang! Are you sure the service monittor is not being restarted by something? Does it have the same PID after you kill it?

I'm not sure why your monitor service would be restarting services. If the package has failed, it should be running the service stop functions, not the start functions. It sounds like you need to revisit your package design and make sure it is properly put together and that the Service Control (monitor) script is calling all the correct functions at the correct time.

If you have a test or development cluster you should be able to test all scenarios including pulling network cables and verify the proper behavior.

Joe Geiger · ‎06-29-2012

A 'kill -9' did indeed kill the monitor script - however, it was immediately restarted -- presumably by the cmcld daemon (at least that is my thought). Note that nothing in the package was halted -- not the services nor even the virtual IP address. This scenario happened once before -- wrote it off as a "glitch" that time -- but, wheh it happened again (forcing a reboot!), I thought I would take a stab in here to see if I was missing something obvious.

What we do now is that we were monitoring exactly one subnet (as defined in the package configuration file) and that the subnet being monitored "disappeared" (due to some network team updates -- this was unexpected). The package was configured to fail over to a second node but did not -- presumably because rather than halting normally, it halted with a FAILED state.

I'm thinking that a call to HP support is in order.

melvyn burnard · ‎07-23-2012

do you have the service configured to restart infintitely?
check the service_restart value:
SERVICE_RESTART[0]="-R"

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Joe Geiger · ‎07-23-2012

Yes -- we do have some of the services for the package set for infinite restart. As I understand it (and I have a feeling that "my understanding" may be about to be corrected!), with an infinite restart configured, the monitor script, upon sensing that the service has "died", will restart the service paying no mind to how many times it has been restarted and, as a consequence, w/o initiating any failover for the package. With that being said, I still (want to) think that when the monitor script itself is killed, that *it* itself should be restarted. But, as I noted, this is not the case -- ergo, my understanding is less than optimal :)

asghar_62 · ‎07-24-2012

Let's look at how Serviceguard monitors the packages and decides to halt or failover a package. Serviceguard actually monitors your services which are usually scripts monitoring the actual application (whatever that is, Oracle etc.). So if your package is configured to restart the service indefinitely, then even if your application is down, what will happen is that the following scenario:

Service will restart --> Service recognize the application is down --> Service terminate itself --> SG (package manager) will restart the service again with "-R" --> Service recognize the application is down --> Service terminate itself --> SG (package manager) will restart the service again with "-R" and so on and on ...

This will go into an indefinite loop.

So SG cannot (based on your decision) failover the package because you told SG restart my monitoring script indefinitely! I would suggest either don't assign any value to SERVICE_RESTAR, which by default will get "none" as value, or set it to an acceptable restart number such as 2 or 3.

Joe Geiger · ‎07-24-2012

Actually, for some servioces in this package, we *don't* want a failure of that service to cause a failover -- that's precisely why we used the infinite restart option. Recall from the original problem description that the package was actually in a FAILED state -- I understand this to mean that SG was unable to shut down the package "cleanly" (i.e. the halt script returned a non-zero status). Although the p[ackage was down (albeit in a "failed" state versus "halted" state), the monitor script for the service in question was still running. Moreover, everytime, I killed the monitor script, it was restarted - -that's what I don't understand. I know it was restarted (versus just ignoring the SIGTERM signal) because it had a different PID everytime I killed it and it was restarted. Because of this behavior, I was unable to unmount the filesystems associated with the package -- our only way to cirmcumvent this was to reboot the system.

asghar_62 · ‎07-24-2012

For a situation where the package halt is failed and the pkg is in a limbo situation (for instance if you lose all connections to the shared VGs and the package cannot deactivate the VG and unmount the LUNs, hence hanging at that step), you can enable the service_fail_fast_enabled.

I understand the situation that you don’t want a failure of the service to cause a package failover, BUT if a service fails more than let say 3 times then most probably something is wrong in the environment and you might want to SG to failover your package to the healthy node, so that you can investigate the issue on the original node. You might want to set the value to a number (I recommended 2 or 3), but you might want set it to something that you are comfortable with) so that after that many failiours, the service will not be restarted any more and if the service_fail_fast_enabled is set to true, the system will TOC and package will failover.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to kill service monitor scripts when package has failed

How to kill service monitor scripts when package has failed