Operating System - HP-UX
1752742 Members
5353 Online
108789 Solutions
New Discussion

Re: HP-UX 11.31 Randomly being unable to reach some servers.

 
michael321123
Visitor

HP-UX 11.31 Randomly being unable to reach some servers.

The server seems to randomly be unable to reach some servers(ping/ssh) by IP or name and requires /sbin/init.d/net start.  DNS servers seem to be unreachable the most even though its just this one server that cannot reach them.

Can someone please help me track down what is causing this issue?

9 REPLIES 9
Steven Schweda
Honored Contributor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> The server seems to randomly be unable to reach some servers(ping/ssh)
> by IP or name

   As usual, showing actual commands with their actual output can be
more helpful than vague descriptions or interpretations.

> and requires /sbin/init.d/net start.

   What does that mean?

> DNS servers seem to be unreachable the most even though its just this
> one server that cannot reach them.

   Ok.  One server has some kind of network problem.  As above, I don't
know what "seem to be unreachable" means to you.  What, exactly, are you
doing, and what, exactly, happens when you do it?

> Can someone please help me track down what is causing this issue?

   It might be easier if you provided more actual evidence, and less
analysis.

   Some part of the network hardware (in or outside "The server") could
be bad.  If the hardware is all good, then the problem is likely in the
software (including the network configuration).

   One way to get unreliable operation is to have two devices with the
same IP address.  In that case, one system can send a message, and the
reply can go to a different system (which has the same IP address).

michael321123
Visitor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

Sometimes it will be fine for weeks or a month and other times it will have a problem multiple times a week.

Example of networking issue:

I am not able to ping 10.1.10.7 and 10.1.10.14 but can ping 10.1.10.2. Other servers can ping all 3 servers with out any packet loss.. After restarting the network service by running  "/sbin/init.d/net start" I am again able to ping all 3 servers. This networking issue doesnt just effect these 3 servers but seems to randomly become unable to reach servers.

 

 ping 10.1.10.2
PING 10.1.10.2: 64 byte packets
64 bytes from 10.1.10.2: icmp_seq=0. time=0. ms
64 bytes from 10.1.10.2: icmp_seq=1. time=0. ms

----10.1.10.2 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/0/0
[/home/root]# ping 10.1.10.14
PING 10.1.10.14: 64 byte packets

----10.1.10.14 PING Statistics----
5 packets transmitted, 0 packets received, 100% packet loss
[/home/root]# ping 10.1.10.7
PING 10.1.10.7: 64 byte packets

----10.1.10.7 PING Statistics----
4 packets transmitted, 0 packets received, 100% packet loss

 

[/home/root]# /sbin/init.d/net start
----10.1.10.14 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/0/0
[/home/root]# ping 10.1.10.7
PING 10.1.10.7: 64 byte packets
64 bytes from 10.1.10.7: icmp_seq=0. time=0. ms
64 bytes from 10.1.10.7: icmp_seq=1. time=0. ms

----10.1.10.7 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/0/0
[/home/root]# ping 10.1.10.2
PING 10.1.10.2: 64 byte packets
64 bytes from 10.1.10.2: icmp_seq=0. time=0. ms
64 bytes from 10.1.10.2: icmp_seq=1. time=0. ms

----10.1.10.2 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/0/0

 

> Ok.  One server has some kind of network problem.  As above, I don't
> know what "seem to be unreachable" means to you.  What, exactly, are you
> doing, and what, exactly, happens when you do it?

When i said "seems to be unreachable" what i meant was that i can ping other servers when I notice i am unable to ping a few servers(shown above). Other servers can ping the unreachable servers. So all though it should be reachable/pingale i am unable to and restarting the network service by running /sbin/init.d/net start immediately resolves the issue.

Below is the out put of some config files and network status. Please let me know if there is anything else that might be helpful.

 

[/home/root]# cat /etc/rc.config.d/netconf

HOSTNAME="server1"
OPERATING_SYSTEM=HP-UX
LOOPBACK_ADDRESS=127.0.0.1

INTERFACE_NAME[0]="lan0"
IP_ADDRESS[0]="10.1.11.180"
SUBNET_MASK[0]="255.255.0.0"
BROADCAST_ADDRESS[0]=""
INTERFACE_STATE[0]=""
DHCP_ENABLE[0]="0"
INTERFACE_MODULES[0]=""


ROUTE_DESTINATION[0]="default"
ROUTE_MASK[0]=""
ROUTE_GATEWAY[0]="10.1.11.254"
ROUTE_COUNT[0]="1"
ROUTE_ARGS[0]=""
ROUTE_SOURCE[0]=""

GATED=0
GATED_ARGS=""

RDPD=0


RARPD=0

DEFAULT_INTERFACE_MODULES=""

 

[/home/root]# cat /etc/resolv.conf
domain abc.123.net
nameserver 10.1.10.2
nameserver 10.1.10.14
nameserver 10.1.10.7

 

[/home/root]# netstat -rn
Routing tables
Destination           Gateway            Flags Refs Interface  Pmtu
127.0.0.1             127.0.0.1          UH    0    lo0       32808
10.1.11.180           10.1.11.180        UH    0    lan0      32808
10.1.0.0              10.1.11.180        U     2    lan0       1500
127.0.0.0             127.0.0.1          U     0    lo0       32808
default               10.1.11.254        UG    0    lan0       1500

 

[/home/root]# cat /etc/nsswitch.conf
hosts: files[NOTFOUND=continue UNAVAIL=continue] dns
ipnodes: files[NOTFOUND=continue UNAVAIL=continue TRYAGAIN=return] dns

 

                      LAN INTERFACE STATUS DISPLAY
                       Mon, Oct 10,2016  10:05:12

PPA Number                      = 0
Description                     = lan0 HP PCI-X 1000Base-T Release B.11.31.1009
Type (value)                    = ethernet-csmacd(6)
MTU Size                        = 1500
Speed                           = 1000000000
Station Address                 = 0x16353ebd20
Administration Status (value)   = up(1)
Operation Status (value)        = up(1)
Last Change                     = 191
Inbound Octets                  = 611708276
Inbound Unicast Packets         = 116144
Inbound Non-Unicast Packets     = 8145237
Inbound Discards                = 0
Inbound Errors                  = 0
Inbound Unknown Protocols       = 1600
Outbound Octets                 = 80814480
Outbound Unicast Packets        = 194564
Outbound Non-Unicast Packets    = 270
Outbound Discards               = 0
Outbound Errors                 = 0
Outbound Queue Length           = 0
Specific                        = 655367

Press <Return> to continue


Ethernet-like Statistics Group

Index                           = 1
Alignment Errors                = 0
FCS Errors                      = 0
Single Collision Frames         = 0
Multiple Collision Frames       = 0
Deferred Transmissions          = 0
Late Collisions                 = 0
Excessive Collisions            = 0
Internal MAC Transmit Errors    = 0
Carrier Sense Errors            = 0
Frames Too Long                 = 0
Internal MAC Receive Errors     = 0

 

> Some part of the network hardware (in or outside "The server") could
> be bad.  If the hardware is all good, then the problem is likely in the
> software (including the network configuration).

I dont have access to anything outside the servers. There is nothing in dmesg or /var/adm/syslog/syslog.log relating to any network issues other then the following which i am unfamiliar with.

WARNING: hpsol_strioctl(): TI_GETPEERNAME failed, T_ADDR_REQ fail error = ENOTCONN.

Steven Schweda
Honored Contributor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

   As I read this:

> [...]
> IP_ADDRESS[0]="10.1.11.180"
> SUBNET_MASK[0]="255.255.0.0"
> [...]
> ROUTE_DESTINATION[0]="default"
> ROUTE_MASK[0]=""
> ROUTE_GATEWAY[0]="10.1.11.254"
> [...]

> ping 10.1.10.2
> [...]
> ping 10.1.10.14
> [...]
> ping 10.1.10.7
> [...]

your system is on the 10.1.11.* subnet, and the systems with unreliable
communication are on the 10.1.10.* subnet.  And the router on the
10.1.11.* subnet is at 10.1.11.254.

   What is the router at 10.1.11.254?

> Other servers can ping all 3 servers [...]

   On which subnet are these "Other servers"?

> I dont have access to anything outside the servers.

   Network switches?  Routers?  Cables?

> [...] There is nothing in dmesg or /var/adm/syslog/syslog.log relating
> to any network issues [...]

   On which system?  All of them?

   What are the 10.1.10.* systems using to reach your 10.1.11.* subnet?
(Route(s)? Router(s)?)

   For "ping" to work, your system must get a message to another system,
and that other system must get its reply back to your system.  If your
system is configured properly, then the message may get to the other
system.  But, if the return route is bad on the other system, then you
may not get the reply, even when your system is configured properly.
You need to look at the systems on both ends.

michael321123
Visitor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> your system is on the 10.1.11.* subnet, and the systems with unreliable
> communication are on the 10.1.10.* subnet.  And the router on the
> 10.1.11.* subnet is at 10.1.11.254.

 > What is the router at 10.1.11.254?

The subnet is 10.1.*.*  The router is the router for the whole subnet.


> On which subnet are these "Other servers"?
There all on 10.1.0.0/16

Network switches?  Routers?  Cables?
I have access to none of these and as restarting the network service temporarily clears the issue i think its internal to the server.

> On which system?  All of them?
Yes, on all the servers i checked there is nothing showing up in the 2 logs about the network problem.

>What are the 10.1.10.* systems using to reach your 10.1.11.* subnet?
>(Route(s)? Router(s)?)
There on the same subnet so no routes or routers needed.

Bill Hassell
Honored Contributor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

You may be seeing dead gateway detection.

This is an old setting that should never have been turned on by default. When set, the network code regularly pings routers to see if they are alive (even though ping is a primitive and almost useless test). When the router fails to respond, the network code assumes that the router is dead and stops using that route (an even more useless action). It is not unusual for your network security team to disable ICMP response (ie, ping) but with this setting in HP-UX, all routed traffic is halted because of a missed ping.

You need to set the dead gateway detect to off on *every* HP-UX server you have.

To make the change permanent, edit the nddconf file in /etc/rc.config.d and add this:

TRANSPORT_NAME[0]=ip
NDD_NAME[0]=ip_ire_gw_probe
NDD_VALUE[0]=0

 The above assumes that there are no [0] entries already. If there are, use the next available array reference such as [1] or [2].

Then run: ndd -c

which reads the file and performs the settings. This sets the value to 0 and validates that the nddconf file is of the proper format.

(Did I mention that *every* HP-UX server needs this fix?) As Steven mentions, ping involves two systems as well as the pathway in both directions. Your network stats don't show any signal quality issues so you may also want to check your router stats as well as the remote system networking. Also check for any duplicate IP addresses on both subnets.



Bill Hassell, sysadmin
michael321123
Visitor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> You may be seeing dead gateway detection.

The servers i have list are all on the same subnet so dead gateway is not the issue plus its not stoping access to all servers.

traceroute 10.1.10.2
traceroute to 10.1.10.2 (10.1.10.2), 30 hops max, 40 byte packets
 1  server2.amhc.amhealthways.net (10.1.10.2)  0.207 ms *  0.135 ms

Steven Schweda
Honored Contributor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> As I read this:
> [...]
> > SUBNET_MASK[0]="255.255.0.0"
> [...]
> your system is on the 10.1.11.* subnet, and the systems with unreliable
> communication are on the 10.1.10.* subnet. [...]

> The subnet is 10.1.*.* [...]

   Sorry.  Apparently my brain stopped working long ago.  I thought that
I saw three 255's, not two.  (Duh.)

   Of course, you've shown the parameters for only one of the systems,
so we must trust that all the other systems' interfaces have the same
(/16) netmask.

   It would be interesting to see some tests run in the opposite
direction (ping, traceroute, ...) when you see the problem.  Otherwise,
I may be out of ideas.  If the system at 10.1.11.180 is the only one
which has this problem, then my best guess would still be a duplicate IP
address somewhere.  (Do you have a DHCP server somewhere which is
handing out 10.1.11.180 to some other system, for example?  But I would
expect someone to log an error if this happened.)

michael321123
Visitor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> Of course, you've shown the parameters for only one of the systems,
> so we must trust that all the other systems' interfaces have the same
> (/16) netmask.

They are indeed all /16

> It would be interesting to see some tests run in the opposite
> direction (ping, traceroute, ...) when you see the problem.  Otherwise,
> I may be out of ideas.  If the system at 10.1.11.180 is the only one
> which has this problem, then my best guess would still be a duplicate IP
> address somewhere.  (Do you have a DHCP server somewhere which is
> handing out 10.1.11.180 to some other system, for example?  But I would
> expect someone to log an error if this happened.)

I dont have access to the 3 dns servers i have listed so ill have to wait until its unable to access another server i have access to.
The IPs should be all static but it is possible this IP was used twice but i would expect another server to be having problems.

Also if it is a duplicate IP problem why would restarting the network service have any affect?

Steven Schweda
Honored Contributor

Re: HP-UX 11.31 Randomly being unable to reach some servers.

> Also if it is a duplicate IP problem why would restarting the network
> service have any [e]ffect?

   I don't know enough about the low-level details of IP networking to
say anything with any confidence, but I can imagine that activating an
interface might cause ARP data for that interface to be broadcast, which
might cause other systems to associate the newly activated interface
with the duplicate address instead of whoever else has it.  But I'm only
guessing, so no bets.