Re: Regarding trigger for node reset after uptime around 248.5 days

Minoru Asano · ‎11-14-2006

Hello,

We found the following description in the SGLX_00005.text of patch kit:

Serviceguard causes a node to reset for
no apparent reason after a system uptime
of around 248.5 days with a 2.4 Linux
kernel or 24.8 days with a 2.6 kernel.
The symptom is that the deadman driver
expires without apparent cause.
Please note that this was also fixed in
SG A.11.16.02 (the RedHat4 support release).

[QUESTION]
- Could you tell me the root cause what is trigger ?
Is it uptime of system, or running time of cluster daemon ?
Is it reset with executing cmhaltcl ?

In fact, one node is running for 260 days.

Thank you for advice.
Best Regards.
/Minoru.Asano

John Bigg · ‎11-15-2006

I think you already have my answer :-)

As written in the text you quoted from the patch text shown above, "after a system uptime..."

And the reason that one node managed to get to 260 days is that because the first node reset at 248.5 days the other node was only a single node cluster when it reached the same 248.5 days of uptime shortly afterwards (since they were both booted at about the same time). The node reset only occurs when the deadman timer is enabled, and it is only enabled when there is more than 1 node in the cluster.

Uwe Zessin · ‎11-15-2006

Cool. Looks like just _another_ case where a 32-bit signed counter with a 10ms (248.5 days) or 1 ms (24.8 days) resolution wraps around:

(2**31)/100/60/60/24 = 248.55

.

John Bigg · ‎11-15-2006

The defect also affects 64 bit kernels. Here we will reset after 2925 million years for a 2.4 kernel and 292 million years for a 2.6 kernel.

It's actually further complicated by the particular version of Linux since the default 2.6 kernels initialise jiffies to 4294667296 rather than zero (even for 64 bit kernels!) although SUSE have modified that back to zero.

Therefore with a RedHat 32 bit 2.6 kernel the reset will occur after 5 minutes.

Minoru Asano · ‎11-16-2006

Hello,

Thank you for reply and suggestion.
I have gotten enough information.

The one system has started as single node cluster because "AUTOSTART_CMCLD" is "0".
So deadman did not work.

I could explain this phenomenon to the customer.

Thank you.
Best Regards.

/Minoru.Asano

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Regarding trigger for node reset after uptime around 248.5 days

Regarding trigger for node reset after uptime around 248.5 days