1823415 Members
2781 Online
109655 Solutions
New Discussion юеВ

HB connection

 
gherbi
Occasional Advisor

HB connection

Hi,
we have HB switch between two node (11.23 with SG11.17)
in syslog i have this message :
Jan 21 21:45:52 gold1 cmcld[2114]: HB connection to 192.168.0.2 not responding, closing

Jan 21 21:50:37 gold1 cmcld[2114]: HB connection to 192.168.0.2 is responding

please comment
8 REPLIES 8
Kevin Wright
Honored Contributor

Re: HB connection

looks like you had a connectivity issue. somebody disconnect a cable?
Fabio Ettore
Honored Contributor

Re: HB connection

Hi gherbi,

once checked whether someone disconnected/reconnected the cables on HB switch, check if the message is reported more times in syslog.log (and OLDsyslog.log also).
If not so then it should be an isolated event pretty sure for an isolated communication problem otherwise check the health of switch, switch ports, cables.

Best regards,
Fabio
WISH? IMPROVEMENT!
Samir Pujara_1
Frequent Advisor

Re: HB connection

Hi,

Surly its network connectivity issue. Most common is the loose connectivity to the HUB used for HB connection or the Loose power connection for the same.

In other case some one might have removed it by mistake.

Samir
gherbi
Occasional Advisor

Re: HB connection

Thanks,but the message is repeated practically every hour in the file syslog, and on both node
Fabio Ettore
Honored Contributor

Re: HB connection

gherbi, as wrote by all people here, if the message "is repeated" then it is a communication problem between switch connections. Check the health of switch, switch ports, cables.
If the message always is repeated on the same subnet (192.168.0.2) you should isolate the problem on it.

Best regards,
Fabio
WISH? IMPROVEMENT!
John Bigg
Esteemed Contributor

Re: HB connection

Early versions of Serviceguard did not log messages when HB paths stopped responding and would not re-establish communications until the last HB path failed at which point a cluster reformation would take place and all HB paths would be closed and re-established.

However, from version 11.09 onwards (releases older than 11.15 require patches for this) a remote communications health monitor was added which monitors data on the connections and attempts to re-establishes them if they fail. See for example the defect description for item 6 in 11.14 patch PHSS_27246. The messages you are seeing are being reported by the rcomm health monitor and indicate that although the socket connection is still open, no heartbeat messages are being received. Therefore the connection is closed and attempted to be re-opened. You would expect to see messages of the form "HB connection to X.X.X.X is responding" if the connection is re-established successfully.

Now we can say for sure that the problem is not a loose cable on the lan card, or failing switch which is directly connected to the lan card since in these cases Serviceguard would detect a failed lan and would mark the lan down. In these cases you would see lan failure messages which you do not report.

Therefore the most likely cause is a problem elsewhere in the network which is preventing traffic between node gold1 and IP address 192.168.0.2 from getting through.

If these messages are regularly repeated, something is intermittently preventing the heartbeat messages but the problem is not a hard problem.

You may wish to setup a ping from the node to the specific IP address and look for packet loss. When the problem occurs you will then need to try and track down where in your network the failure is occuring.
Emil Velez
Honored Contributor

Re: HB connection

What is your heartbeat interval... You could be using the default HB interval of 1 second and a timeout of 2 seconds which is very fast. Soometimes what happens if a hearbeat gets missed it starts reforming and a hearbeat comes in and it stops reforming.

Do you have a heartbeat on another lan ? Has the cluster reformed ?

look for reforming messages in addtion.

Change the HB timeout to 5 seconds.

Good luck
gherbi
Occasional Advisor

Re: HB connection

Dear all,
ther was a network problem,we have change the two ports on the switche.

thanks for all