SGLX A.12.80 TimeOut for cmrunpkg when second node failed.

yilmazaydin — Wed, 30 Nov 2022 05:45:26 GMT

Hello.

We are faced with longtime timeout during package starting. What happens:
1. Package failed on the node without restart on second node. First node was gone off and unreachable.

2. We are trying start package to second node, but command cmrunpkg hung too long - no any messages in the package log file or system journal. We can not wait more than 5 minutes and restarting second node.

3. After restarting second node we can run singlenode cluster via command cmruncl -f -n node2. Package autostarted too.

Question - Why command cmrunpkg hung when one node of 2-nodes cluster is unreachable?

Additional info:

We have tested and confirmed the behavior of the cluster node during the period when the second node is unavailable.

Test-case
Cluster has two node and one package - node1 and node2
We should stop testpkg
We should stop node node2 with poweroff
We should check deadman (lsof |grep deadman)
We should start testpkg on the running node

Short results (timelapse)

Normal time for move package
date && cmhaltpkg testpkg && cmrunpkg -n node1 testpkg && date ##Fri Nov 11 12:49:09 UTC 2022 - Fri Nov 11 12:50:22 UTC 2022
date && cmhaltpkg testpkg && cmrunpkg -n node2 testpkg && date ##Fri Nov 11 12:51:15 UTC 2022 - Fri Nov 11 12:52:12 UTC 2022

Not Normal time
date && cmviewcl ##Fri Nov 11 12:56:31 UTC 2022
date && cmhaltpkg testpkg ##Fri Nov 11 12:56:40 UTC 2022
date && poweroff ##Fri Nov 11 12:57:07 UTC 2022
date && cmviewcl ##Fri Nov 11 12:57:20 UTC 2022
date && lsof |grep deadman ##Fri Nov 11 12:57:29 UTC 2022
date && tail -500 /var/log/messages | grep cmcld ##Fri Nov 11 12:57:37 UTC 2022
date && time cmrunpkg testpkg ##Fri Nov 11 12:58:27 UTC 2022 - waiting 18 minutes and abort it
date && cmhaltnode -f && date ##Fri Nov 11 13:17:41 UTC 2022
date && cmruncl -f -n node1 && date ##Fri Nov 11 13:18:59 UTC 2022

Has anyone encountered a similar situation?

Re: SGLX A.12.80 TimeOut for cmrunpkg when second node failed.

Sush_S — Wed, 16 Nov 2022 05:45:49 GMT

Hi,

You are hitting a known problem which should be fixed in the next patch release(not sure on ETA). Please reach out to the support team for any workaround.

Thank you!

topic SGLX A.12.80 TimeOut for cmrunpkg when second node failed. in Operating System - Linux

SGLX A.12.80 TimeOut for cmrunpkg when second node failed.

Re: SGLX A.12.80 TimeOut for cmrunpkg when second node failed.