topic Re: one node reboot in Operating System - HP-UX

one node reboot

Sanjiv Sharma_1 — Thu, 05 Jun 2003 00:42:26 GMT

Hi,

I have a two node cluster.
2* N4000/HP-UX 11.00
B3935DA A.11.12 MC / Service Guard

Yesterday about 19:45 one of the node rebooted and all the package running on it moved to the other node.

From the OLDsyslog.log I can understand that there seems to have some problem with the samba.

What is this error and what needs to be done?
OLDsyslog.log attached.

Re: one node reboot

Jeff Schussele — Thu, 05 Jun 2003 01:31:04 GMT

Hi sanjiv,

Man that's UGLY.
I see that as a cascade failure.
First messages indicate timeouts hinting at network trouble.
Then the first set of errors shows that Samba couldn't open it's DB file which looks like all the world like a connection problem. Then that's reinforced by the inability to create network sockets. Then you seem to exhaust file locks - game's over.
That's a classic "reboot or it ain't gonna recover" scenario - hence the system paniced.
I'd start by asking for network logs & system logs from the *other* end of those connections because I see no errors for the local NIC. By that I mean this system could have well been the "victim" of severe trouble elsewhere. But of the sort where the NIC to switch link never dropped.

My $0.02,
Jeff

Re: one node reboot

Sanjiv Sharma_1 — Thu, 05 Jun 2003 01:44:48 GMT

Hi Jeff,

Enclosed is the syslog.log of the 2nd node.

Re: one node reboot

Jeff Schussele — Thu, 05 Jun 2003 01:57:30 GMT

Match up the times on those logs. I'm even more convinced that you had a BIG connection problems going on.
To *where* were these samba connections ? I'd bet that system or a network device in it's subnet lunched.
I strongly advise you also look at the Service Guard package logs on both systems for further clues. Usually located in /etc/cmcluster/pkg_name.

Rgds,
Jeff

Re: one node reboot

Rainer von Bongartz — Thu, 05 Jun 2003 04:50:55 GMT

Change kernel param nflocks to

10*maxusers/2

Then you should not run into this situation

Regards

Rainer

Re: one node reboot

Duncan Edmonstone — Thu, 05 Jun 2003 07:57:41 GMT

Here's the critical part of you syslog on the 2nd node:

Jun 4 19:30:24 ijmsia02 cmcld: Timed out node ijmsia01. It may have failed.
Jun 4 19:30:24 ijmsia02 cmcld: Attempting to form a new cluster
Jun 4 19:30:37 ijmsia02 nmbd[2331]: [2003/06/04 19:30:37, 0] nmbd/nmbd_become_lmb.c:(404)
Jun 4 19:30:37 ijmsia02 nmbd[2331]: *****
Jun 4 19:30:37 ijmsia02 nmbd[2331]:
Jun 4 19:30:37 ijmsia02 nmbd[2331]: Samba name server IJMSIAFS01 is now a local master browser for workgroup SGP.HP.COM on subnet 15.85.28.36
Jun 4 19:30:45 ijmsia02 cmcld: Obtaining Cluster Lock
Jun 4 19:30:46 ijmsia02 cmcld: Turning off safety time protection since the cluster

This is telling us that the second node was unable to communicate with the first via any of it's heartbeat networks - therefore it didn't know the state of the first node and a race for the cluster lock occurred. The second node won this race, so the first node was TOC'd.

As the others have indicated, you seem to have some kind of network issue - this may be in the network itself, or on either node. My advice would be to ensure you are bang up-to-date with all network related patches on both nodes, and see if the problem persists.

HTH

Duncan

Re: one node reboot

Jarle Bjorgeengen — Thu, 05 Jun 2003 08:29:41 GMT

Do you have a separate heartbeat LAN ? If not, consider bying an addtional NIC on each nod, and make HB's travel on the separate LAN too.

Short time workaround may be to increase the heartbeat timout and heartbeat interval in the cluster config.

Rgds Jarle

Re: one node reboot

Claus Nymann — Wed, 23 Jul 2003 14:27:11 GMT

Hi Sanjiv,

Not exactly clear to me if you got the answers you needed - but as I'm just crawling out from the smoking ruins of completely the same experience (down to the tdb nagging failing locks) I'm more than happy to share..!

Do follow the CIFS/9000 (HP-name for Samba) installation-guide - and pay *SPECIAL* attention to the newish requirements for kernel-parameters for the newer versions! (You can find the guides that correspond to your version of "CIFS/9000"/Samba on http://www.docs.hp.com/hpux/netcom/index.html#CIFS/9000) The 'rule of thumb' seems to be something like "10 times as many 'nflocks' as users and 23 times as many 'nfiles' as users".

I'm sorry to say that our CIFS/9000-server failed miserably even if it was well inside these boundaries and had more than 35 filelocks per user at the time of the crash - it is however catering to software-developers, which could possibly translate into "LOTS of open files at any one point in time" and maybe the figures above (the factor 10 part) needs to be adjusted according to the load-type..?! (Jury's still out on that one! :-)

Br.
Claus