Operating System - HP-UX
1833184 Members
3235 Online
110051 Solutions
New Discussion

LAN access fails every after 7 days

 
cam9269
Regular Advisor

LAN access fails every after 7 days

Hi All,

We've been having this recurring issue with one of our rp7400's. The LAN access fails every week and we had to reboot the server to revive it.

I have the following NIC installed
LAN0 - 10/100MBps - telnet access
LAN7/8 - APA LAN900 - NFS link
LAN1/2 - FDDI - Backup link

Currently, LAN7 has issues with it link, so whenever we boot the machine up APA complains that one link is down, so LAN8 takes over the APA group.

For the next 6 days, everything will be ok with the network access. Then on the 7th day, we start to experience packet loss as seen thru the ping tests. Then shortly, access to LAN0 and LAN900 is dead.

I've been playing with the idea of a dead gateway detection, but am not really sure about it.

The thing is, when we loose the connection, LAN8 logs errors in nettl saying it's link has gone down also, which leaves LAN900 nothing, which makes it go into disaster.

So, I was thinking, since DGD is activated on our server, it could've very well have been triggered by LAN900 going down.

But, LAN0 is only incessible thru the other subnets, users in the same subnet can still access the server.

I need ideas to bounce off from. would really appreciate your inputs.

TIA!
10 REPLIES 10
Bill Hassell
Honored Contributor

Re: LAN access fails every after 7 days

I can't think of a good reason to ever enable DGD. All it does is silently disable network connections. And many network departments turn off ICMP (ping) responses for security reasons. It make the sysadmin job very painful.

To see all gateways you could use ip_ire_status

ndd â get /dev/ip ip_ire_status | grep â e IRE_GATEWAY â e flag

This results in a list of all gateways, the flags will indicate a dead gateway.


Check the current value:
ndd â get /dev/ip ip_ire_gw_probe


Disable Dead Gateway Detection:
ndd â set /dev/ip ip_ire_gw_probe 0


nddconf entry example:
TRANSPORT_NAME[3]=ip
NDD_NAME[3]=ip_ire_gw_probe
NDD_VALUE[3]=0


Bill Hassell, sysadmin
Bill Hassell
Honored Contributor

Re: LAN access fails every after 7 days

Sorry about the ugly formatting. Trying again:

To see all gateways you could use ip_ire_status

ndd -get /dev/ip ip_ire_status | grep -e IRE_GATEWAY -e flag

This results in a list of all gateways, the flags will indicate a dead gateway.


Check the current value:
ndd -get /dev/ip ip_ire_gw_probe


Disable Dead Gateway Detection:
ndd -set /dev/ip ip_ire_gw_probe 0


Bill Hassell, sysadmin
cam9269
Regular Advisor

Re: LAN access fails every after 7 days

Thanks Bill,

Command:
ndd -get /dev/ip ip_ire_status | grep -e IRE_GATEWAY -e flag

Has the ff outputs:
IRE rfq stq addr mask src gateway mxfrg rtt ref type flag
000000004cb4f388 0000000000000000 0000000000000000 000.000.000.000 00000000 134.144.141.008 134.144.136.050 01500 00000 000 IRE_GATEWAY

No dead gateways here... And yes, DGD is enabled for this machine. We'll be disabling it from shell and from the nddconf

Is it possible that traffic fron LAN8 (LAN900) is 'transferred' to LAN0 when LAN8 dies? Currently, we're experiencing intermittent LAN8 connections as seen from syslog.log:

APA/LM: FOG:lan900 - lan8 is down
APA/LM: FOG:lan900 is down
APA/LM: FOG:lan900 - lan8 is up (lan8 is active)
APA/LM: FOG:lan900 is up (lan8 is active)
APA/LM: FOG:lan900 - lan8 is down
APA/LM: FOG:lan900 is down
APA/LM: FOG:lan900 - lan8 is up (lan8 is active)
APA/LM: FOG:lan900 is up (lan8 is active)

When this happens, ping statistics to other severs are the first to be affected, then NFS connections start failing, then access to LAN0 dies. Ideas?
Bill Hassell
Honored Contributor

Re: LAN access fails every after 7 days

APA can be configured several different ways so if LAN0 and LAN8 are in the same APA redundancy group, then it should take over. However, it sounds like you have a very nasty network failure -- it is not failing hard but goes very intermittent. When you start having problems, I would get a network analyzer running (Wireshark) and see what is happening. I would never run NFS over a flaky network -- it will hang your entire system.

> APA/LM: FOG:lan900 is down
> APA/LM: FOG:lan900 - lan8 is up (lan8 is active)

These look very ominous. I would not enable LAN8 until the network problems are resolved.


Bill Hassell, sysadmin
cam9269
Regular Advisor

Re: LAN access fails every after 7 days

Thanks Bill, actually LAN7 and LAN8 are supposedly partners for APA. But LAN7 has been disconnected since 2008, for reasons nobody can recall at the moment. So, LAN8 is what's left from the group. So, I guess my theory that LAN8's traffic is routed to LAN0 (it having the default gateway) is not at all valid then?

Is wireshark the only way to go? I'm not so sure if I can get this installed on this machine though.
Matti_Kurkela
Honored Contributor

Re: LAN access fails every after 7 days

Installing Wireshark on your production system requires a load of dependencies. Fortunately there is a way to avoid that.

You can use tcpdump (which has much simpler requirements) or even HP-UX's native tools to create a network trace on the system that has the problem, then move the trace file to another host (e.g. your personal workstation) and use wireshark on it to analyze the stored trace. Wireshark can read most common network trace file formats.

Taking a network trace on HP-UX with no extra software installed:
http://www.compute-aid.com/nettl.html

MK
MK
Bill Hassell
Honored Contributor

Re: LAN access fails every after 7 days

> LAN7 has been disconnected since 2008

I would drop the APA configuration completely. That's a lot of complicated software that is essentially doing nothing (I assume that LAN7 and LAN8 are the only members).

As far as Wireshark, I would install it on a laptop (it is much easier and much simpler to setup) rather than on HP-UX. Then you can use tcpdump or even nettl to perform traces. Wireshark reads virtually every packet trace program there is.

What do the logs show in your routers? I would turn on stats for the problem ports and get the network administrators tracing the problem. In general, a LAN down message means that fundamental network connectivity has been dropped. You may have a bad switch or router that is causing the issue. Try a different port on the switch. Is there a forklift truck running over the LAN cables every week?


Bill Hassell, sysadmin
cam9269
Regular Advisor

Re: LAN access fails every after 7 days

Thanks guys... What we now did was to connect both NICs on the Giga-switches. Some dropouts are being seen in the syslog file. So, currently, the network guys are looking at this.

Just an additional question. I got some things on the server's routing table. Here's what we have:

Routing tables
Destination Gateway Flags Refs Interface Pmtu
127.0.0.1 127.0.0.1 UH 0 lo0 4136

I am just confused about the routes pointing to the LAN900 interface. It contains several routes here. Can you guys explain how this is working?

Thanks!
134.144.188.38 134.144.188.38 UH 0 lan2 4136
134.144.141.8 134.144.141.8 UH 0 lan0 4136
134.144.188.166 134.144.188.166 UH 0 lan900 4136
134.144.141.20 134.144.188.165 UGH 0 lan900 1500
134.144.141.17 134.144.188.162 UGH 0 lan900 1500
134.144.141.16 134.144.188.161 UGH 0 lan900 1500
134.144.141.19 134.144.188.164 UGH 0 lan900 1500
134.144.141.18 134.144.188.163 UGH 0 lan900 1500
134.144.141.12 134.144.188.34 UGH 0 lan2 4352
134.144.141.11 134.144.188.33 UGH 0 lan2 4352
134.144.188.32 134.144.188.38 U 2 lan2 4352
134.144.188.160 134.144.188.166 U 2 lan900 1500
134.144.128.0 134.144.141.8 U 2 lan0 1500
127.0.0.0 127.0.0.1 U 0 lo0 4136
default 134.144.136.50 UG 0 lan0 1500
cam9269
Regular Advisor

Re: LAN access fails every after 7 days

Here's a better view:

Routing tables
Destination Gateway Flags Refs Interface Pmtu
127.0.0.1 127.0.0.1 UH 0 lo0 4136
134.144.188.38 134.144.188.38 UH 0 lan2 4136
134.144.141.8 134.144.141.8 UH 0 lan0 4136
134.144.188.166 134.144.188.166 UH 0 lan900 4136
134.144.141.20 134.144.188.165 UGH 0 lan900 1500
134.144.141.17 134.144.188.162 UGH 0 lan900 1500
134.144.141.16 134.144.188.161 UGH 0 lan900 1500
134.144.141.19 134.144.188.164 UGH 0 lan900 1500
134.144.141.18 134.144.188.163 UGH 0 lan900 1500
134.144.141.12 134.144.188.34 UGH 0 lan2 4352
134.144.141.11 134.144.188.33 UGH 0 lan2 4352
134.144.188.32 134.144.188.38 U 2 lan2 4352
134.144.188.160 134.144.188.166 U 2 lan900 1500
134.144.128.0 134.144.141.8 U 2 lan0 1500
127.0.0.0 127.0.0.1 U 0 lo0 4136
default 134.144.136.50 UG 0 lan0 1500
Matti_Kurkela
Honored Contributor

Re: LAN access fails every after 7 days

Seeing the netmasks associated with the route entries could be important in understanding the routes.

Try "netstat -rnv" and post the output here.

MK
MK