1761310 Members
2845 Online
108901 Solutions
New Discussion юеВ

one node reboot

 
SOLVED
Go to solution
Sanjiv Sharma_1
Honored Contributor

one node reboot

Hi,

I have a two node cluster.
2* N4000/HP-UX 11.00
B3935DA A.11.12 MC / Service Guard

Yesterday about 19:45 one of the node rebooted and all the package running on it moved to the other node.

From the OLDsyslog.log I can understand that there seems to have some problem with the samba.

What is this error and what needs to be done?
OLDsyslog.log attached.
Everything is possible
7 REPLIES 7
Jeff Schussele
Honored Contributor

Re: one node reboot

Hi sanjiv,

Man that's UGLY.
I see that as a cascade failure.
First messages indicate timeouts hinting at network trouble.
Then the first set of errors shows that Samba couldn't open it's DB file which looks like all the world like a connection problem. Then that's reinforced by the inability to create network sockets. Then you seem to exhaust file locks - game's over.
That's a classic "reboot or it ain't gonna recover" scenario - hence the system paniced.
I'd start by asking for network logs & system logs from the *other* end of those connections because I see no errors for the local NIC. By that I mean this system could have well been the "victim" of severe trouble elsewhere. But of the sort where the NIC to switch link never dropped.

My $0.02,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Sanjiv Sharma_1
Honored Contributor

Re: one node reboot

Hi Jeff,

Enclosed is the syslog.log of the 2nd node.
Everything is possible
Jeff Schussele
Honored Contributor

Re: one node reboot

Match up the times on those logs. I'm even more convinced that you had a BIG connection problems going on.
To *where* were these samba connections ? I'd bet that system or a network device in it's subnet lunched.
I strongly advise you also look at the Service Guard package logs on both systems for further clues. Usually located in /etc/cmcluster/pkg_name.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Rainer von Bongartz
Honored Contributor
Solution

Re: one node reboot

Change kernel param nflocks to

10*maxusers/2

Then you should not run into this situation

Regards

Rainer
He's a real UNIX Man, sitting in his UNIX LAN making all his UNIX plans for nobody ...

Re: one node reboot

Here's the critical part of you syslog on the 2nd node:

Jun 4 19:30:24 ijmsia02 cmcld: Timed out node ijmsia01. It may have failed.
Jun 4 19:30:24 ijmsia02 cmcld: Attempting to form a new cluster
Jun 4 19:30:37 ijmsia02 nmbd[2331]: [2003/06/04 19:30:37, 0] nmbd/nmbd_become_lmb.c:(404)
Jun 4 19:30:37 ijmsia02 nmbd[2331]: *****
Jun 4 19:30:37 ijmsia02 nmbd[2331]:
Jun 4 19:30:37 ijmsia02 nmbd[2331]: Samba name server IJMSIAFS01 is now a local master browser for workgroup SGP.HP.COM on subnet 15.85.28.36
Jun 4 19:30:45 ijmsia02 cmcld: Obtaining Cluster Lock
Jun 4 19:30:46 ijmsia02 cmcld: Turning off safety time protection since the cluster

This is telling us that the second node was unable to communicate with the first via any of it's heartbeat networks - therefore it didn't know the state of the first node and a race for the cluster lock occurred. The second node won this race, so the first node was TOC'd.

As the others have indicated, you seem to have some kind of network issue - this may be in the network itself, or on either node. My advice would be to ensure you are bang up-to-date with all network related patches on both nodes, and see if the problem persists.

HTH

Duncan

I am an HPE Employee
Accept or Kudo
Jarle Bjorgeengen
Trusted Contributor

Re: one node reboot

Do you have a separate heartbeat LAN ? If not, consider bying an addtional NIC on each nod, and make HB's travel on the separate LAN too.

Short time workaround may be to increase the heartbeat timout and heartbeat interval in the cluster config.

Rgds Jarle
Claus Nymann
New Member

Re: one node reboot

Hi Sanjiv,

Not exactly clear to me if you got the answers you needed - but as I'm just crawling out from the smoking ruins of completely the same experience (down to the tdb nagging failing locks) I'm more than happy to share..!

Do follow the CIFS/9000 (HP-name for Samba) installation-guide - and pay *SPECIAL* attention to the newish requirements for kernel-parameters for the newer versions! (You can find the guides that correspond to your version of "CIFS/9000"/Samba on http://www.docs.hp.com/hpux/netcom/index.html#CIFS/9000) The 'rule of thumb' seems to be something like "10 times as many 'nflocks' as users and 23 times as many 'nfiles' as users".

I'm sorry to say that our CIFS/9000-server failed miserably even if it was well inside these boundaries and had more than 35 filelocks per user at the time of the crash - it is however catering to software-developers, which could possibly translate into "LOTS of open files at any one point in time" and maybe the figures above (the factor 10 part) needs to be adjusted according to the load-type..?! (Jury's still out on that one! :-)

Br.
Claus