Serviceguard
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with Linux ‘ps’ command can cause false failover of packages

Serviceguard for Linux
Honored Contributor

Problem with Linux ‘ps’ command can cause false failover of packages

The Linux command ‘ps pid’ will sometimes return empty, even if the process ‘pid’ exists. This problem occurs with different frequencies in various releases of Linux. It has been seen on RedHat 2.1 and RedHat 3. It is believed to exist in SLES8 and may exist in SLES9 and RedHat 4. The details are in https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158277. The problem is related to how the ‘ps’ command checks the list of pids in the /proc filesystem. This problem is most likely in a very dynamic environment where large number of short-lived processes are being created.

Various Serviceguard for Linux toolkits use this command and it is a suggested method for users writing their own scripts. They check the pid to see if the process being monitored is still running. The ‘ps’ error causes the monitor script to falsely determine that the process is no longer running, causing the package to failover.

The exact command lines that have problems are:

pid=`ps $p_pid | grep $PROC ! awk ‘{print $1}’`
if [ -z “$pid” ]; then

This should be replaced with:

grep $PROC /proc/$p_pid/stat >/dev/null
if [ $? –ne 0 ]; then

Rather than looking through all of the pids in the /proc filesystem, this just checks the pid that is being monitored.

If you think you have experienced a false failover, then check the monitor scripts and make this change.

Even if you have not experienced a false failover, it is recommended that you make this change. Any contributed toolkit that uses the ‘ps’ command in this way will be changed in their next release. Because of testing, this may take up to 3 months for any specific toolkit.

Remember to make the change on all servers that may run the package. Note that because the file is open on the server running the package, it will not be updated immediately. This last node will only be updated after the package is moved. During a maintenance period, move the package and recheck the file on all nodes. Remember, if a server fails between a change to the file and the maintenance period, the file may not have been updated. That is why it is CRITICAL to recheck all nodes after the package move.

As new or updated toolkits are released,
4 REPLIES
Stuart Browne
Honored Contributor

Re: Problem with Linux ‘ps’ command can cause false failover of packages

If you're just disposing the output of 'grep', why not just 'grep -q $PROC /proc/$p_pid/stat' ? Either way, you're going to get ugly messages if the pid doesn't currently exist (No such file or directory).

The furthering of this would be to put the grep straight into the if:

if grep -q $PROC /proc/$p_pid/stat 2>/dev/null

as 'if' checks the exit state of the application.. just a bit quicker than launching 'test' ([) and checking $?.

Anyway, just some thoughts.
One long-haired git at your service...
Huc_1
Honored Contributor

Re: Problem with Linux ‘ps’ command can cause false failover of packages

Oh la! la! ... thanks for letting us on to this one I don't use serviceguard for linux, but as you say this may affects any "ps pid".

Will find/search all my scripts for the use of this.

I did read the bugzilla entry, to try and understand it all, but seem to me Stuart Browne thoughts are correct! way to go, or is there something we missed ?

Jean-Pierre Huc
Smile I will feel the difference
Serviceguard for Linux
Honored Contributor

Re: Problem with Linux ‘ps’ command can cause false failover of packages

Stuart,

We will keep it this way because we'll get any errors from "grep" logged. Also, test is a built in function so there is not major launch overhead.

There may be some advantage to the -q.

We really want to change as little as possible to minimze teh risk of introducing another problem.

Huc,

That's why I posted it with the description - to make everyone who uses this aware of possible problems. Glad it may help.

Stuart Browne
Honored Contributor

Re: Problem with Linux ‘ps’ command can cause false failover of packages

Hrm.. I guess it's only when shown these situations that new things are learnt. I always thought that in bourne-based shells, the single []'s were based off the external test:

lrwxrwxrwx 1 root root 4 May 1 2004 /usr/bin/[ -> test

and that '[[]]' are inbuilt. Sometimes shell man pages are just too long:

man bash: under 'CONDITIONAL EXPRESSIONS'
Conditional expressions are used by the [[ compound command and the test and [ builtin commands to test file attributes and perform string and arithmetic comparisons.

My apologies.

In any case, as the grep isn't disposing of STDERR, you'll still get an ugly-error when the '$p_pid' doesn't exist.
One long-haired git at your service...