Serviceguard
cancel
Showing results for 
Search instead for 
Did you mean: 

Regarding trigger for node reset after uptime around 248.5 days

SOLVED
Go to solution
Minoru Asano
Frequent Advisor

Regarding trigger for node reset after uptime around 248.5 days

Hello,

We found the following description in the SGLX_00005.text of patch kit:

Serviceguard causes a node to reset for
no apparent reason after a system uptime
of around 248.5 days with a 2.4 Linux
kernel or 24.8 days with a 2.6 kernel.
The symptom is that the deadman driver
expires without apparent cause.
Please note that this was also fixed in
SG A.11.16.02 (the RedHat4 support release).

[QUESTION]
- Could you tell me the root cause what is trigger ?
Is it uptime of system, or running time of cluster daemon ?
Is it reset with executing cmhaltcl ?

In fact, one node is running for 260 days.

Thank you for advice.
Best Regards.
/Minoru.Asano
4 REPLIES
John Bigg
Esteemed Contributor
Solution

Re: Regarding trigger for node reset after uptime around 248.5 days

I think you already have my answer :-)

As written in the text you quoted from the patch text shown above, "after a system uptime..."

And the reason that one node managed to get to 260 days is that because the first node reset at 248.5 days the other node was only a single node cluster when it reached the same 248.5 days of uptime shortly afterwards (since they were both booted at about the same time). The node reset only occurs when the deadman timer is enabled, and it is only enabled when there is more than 1 node in the cluster.
Uwe Zessin
Honored Contributor

Re: Regarding trigger for node reset after uptime around 248.5 days

Cool. Looks like just _another_ case where a 32-bit signed counter with a 10ms (248.5 days) or 1 ms (24.8 days) resolution wraps around:

(2**31)/100/60/60/24 = 248.55
.
John Bigg
Esteemed Contributor

Re: Regarding trigger for node reset after uptime around 248.5 days

The defect also affects 64 bit kernels. Here we will reset after 2925 million years for a 2.4 kernel and 292 million years for a 2.6 kernel.

It's actually further complicated by the particular version of Linux since the default 2.6 kernels initialise jiffies to 4294667296 rather than zero (even for 64 bit kernels!) although SUSE have modified that back to zero.

Therefore with a RedHat 32 bit 2.6 kernel the reset will occur after 5 minutes.
Minoru Asano
Frequent Advisor

Re: Regarding trigger for node reset after uptime around 248.5 days

Hello,

Thank you for reply and suggestion.
I have gotten enough information.

The one system has started as single node cluster because "AUTOSTART_CMCLD" is "0".
So deadman did not work.

I could explain this phenomenon to the customer.

Thank you.
Best Regards.

/Minoru.Asano