1748219 Members
4640 Online
108759 Solutions
New Discussion

lan down at IP layer

 
caml0p3z
Occasional Contributor

lan down at IP layer

Hello guys,

 

Recently I've been experiencing some network problems in Serviceguard.  Syslog keeps reporting very frequently failures in lan0:

 

Nov 26 10:52:16 baket cmnetd[10430]: lan0 is down at the IP layer.
Nov 26 10:52:16 baket cmnetd[10430]: lan0 failed.
Nov 26 10:52:12 baket su: + tty?? root-oracle
Nov 26 10:52:16 baket above message repeats 291 times
Nov 26 10:52:16 baket cmnetd[10430]: lan0 switching to lan1
Nov 26 10:52:16 baket cmnetd[10430]: Subnet 10.12.1.0 switching from lan0 to lan1
Nov 26 10:52:16 baket cmnetd[10430]: Subnet 10.12.1.0 switched from lan0 to lan1
Nov 26 10:52:16 baket cmcld[10417]: Local switch has occurred since net_id 0x2 was not found on subnet 100.134.66.0.
Nov 26 10:52:16 baket su: + tty?? root-oracle
Nov 26 10:52:16 baket cmnetd[10430]: lan0 switched to lan1
Nov 26 10:52:16 baket cmcld[10417]: Local switch has occurred since net_id 0x2 was not found on subnet 100.134.66.0.
Nov 26 10:52:28 baket above message repeats 11 times
Nov 26 10:52:28 baket cmnetd[10430]: 10.12.1.137 recovered.

 

We have a two-node cluster, using BL860 Integrity Blade running HP-UX 11.31 and Serviceguard 11.20.  The 2nd node is the one that is constantly reporting this network issue, the 1st node is working perfectly.  For network connectivity we are using VC Flex-10, so they both have the same network profile: 4 virtual lans.

 

If I try to enable lan0 using cmmodnet it works for a while, then a few days later the problem appears again.

 

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY down (disabled) (IP only) 0/0/0/3/0/0/0 lan0
PRIMARY up 0/0/0/3/0/0/2 lan2
STANDBY up 0/0/0/3/0/0/1 lan1
STANDBY up 0/0/0/4/0/0/0 lan8
STANDBY up 0/0/0/4/0/0/1 lan9
STANDBY up 0/0/0/3/0/0/3 lan3

 

This two VC Flex-10 are using two uplinks connected to a Cisco switch.  I just can't really understand why the 2nd node keeps showing this issue.  This is some part of the cluster configuration:

 

NODE_NAME baket
NETWORK_INTERFACE lan0
STATIONARY_IP 10.12.1.137
NETWORK_INTERFACE lan2
HEARTBEAT_IP 192.100.1.20
NETWORK_INTERFACE lan1
NETWORK_INTERFACE lan8
NETWORK_INTERFACE lan9
NETWORK_INTERFACE lan3

 

NODE_NAME thoth
NETWORK_INTERFACE lan0
STATIONARY_IP 10.12.1.136
NETWORK_INTERFACE lan2
HEARTBEAT_IP 192.100.1.10
NETWORK_INTERFACE lan1
NETWORK_INTERFACE lan8
NETWORK_INTERFACE lan9
NETWORK_INTERFACE lan3

 

SUBNET 10.12.1.0
IP_MONITOR ON
POLLING_TARGET 10.12.1.31

SUBNET 192.100.1.0
IP_MONITOR OFF

 

I would like some help on this.  Thank you.

1 REPLY 1
Ralf Seefeldt
Valued Contributor

Re: lan down at IP layer

Hi caml0p3z,

 

There are several possible reasons for this behaviour.

 

let me start wizh the firmware: Are both servers and both VC FLEX-10 modules on the same firmware status?

Are both servers / Flex10 modules /switches using the same fixed or negotiated LAN speed?

Are both servers of the same patchlevel?

Are both servers and Flex10 modules of the same version?

 

Check, whether there are known issues with the firmware(s) and patches.

 

The problem is detected becauce you have IP_MONITORING switched oON for this subnet. Where is your polling target located? In the same location as the server, that does not show any problems?

Can you verify, that the connectivity is realy down? (log the results of a ping or sth. else)

Are there firewalls, routers, ... between your server and the polling target?

 

Does the network recover by itself?

Did you ever have this issue with the switchen network on lan1?

 

I would expect some firmware / hardware problems or some sisruptions or latencies in the LAN.

What are the LAN people saying?

 

Finaly, there might be an issue with the configuration. Once I had problems to reach the quorum server. It was available but cmcheckcl reported otherwise. It turned out that there should be a direct route which had not been established.

Check your routing and set test a hostroute to the polling target.

 

Thats my first guess. I hope some of it helps solving your problem.

 

Bye

Ralf