Operating System - HP-UX
1823958 Members
4943 Online
109667 Solutions
New Discussion юеВ

traceroute hangs from one specific server

 
Tom Geudens
Honored Contributor

traceroute hangs from one specific server

Hi,
I have the following problem :
From server sv00127
#traceroute xw218
traceroute to xw218.dolmen.be (10.118.112.18), 30 hops max, 20 byte packets
1 ro11-fe400-101.dolmen.be (10.101.1.25) 1 ms 1 ms 1 ms
2 pc3926.dolmen.be (10.102.2.1) 1 ms 1 ms 1 ms
3 rodca.dolmen.be (192.168.30.1) 30 ms 3 ms 2 ms
4 xw218.dolmen.be (10.118.112.18) 3 ms

... hangs, does not return to commandline

From server sv00128
#traceroute xw218
traceroute to xw218.dolmen.be(10.118.112.18), 30 hops max, 20 byte packets
1 ro11-fe400-101.dolmen.be (10.101.1.25) 3 ms 1 ms 1 ms
2 pc3926.dolmen.be (10.102.2.1) 1 ms 1 ms 1 ms
3 rodca.dolmen.be (192.168.30.1) 2 ms 3 ms 2 ms
4 xw218.dolmen.be (10.118.112.18) 3 ms * 11 ms

... times out and returns to commandline, as it should be

As you can see, traceroute hangs on server sv00127 but works from server sv00128 (or any other server I can find here). On top of that, if I reboot server sv00127 (which does not happen often, it's a crucial production server) it works from server sv00127 too ... for a couple of days.

Has anyone noticed this strange behavior before ? Is it patch related (I could not find anything related ... but I might have overlooked). It might be usefull to know that sv00127 is the primary DNS server (but sv00128 is secondary so that shouldn't have any impact).

The networkguys are getting annoyed with this (and than they turn around and annoy me :-).
Thanks in advance,
Tom
A life ? Cool ! Where can I download one of those from ?
34 REPLIES 34
U.SivaKumar_2
Honored Contributor

Re: traceroute hangs from one specific server

Hi,
The problem is with the DNS . If traceroute cannot inverse lookup it hangs. To identify the
problem.

Give

#traceroute -n xw218

It will not hang now.

regards,
U.SivaKumar
Innovations are made when conventions are broken
U.SivaKumar_2
Honored Contributor

Re: traceroute hangs from one specific server

Hi,
To solve the DNS problem
Compare the /etc/resolv.conf and /etc/nsswitch.conf file in both servers. Use Same files of good server in problematic server.

regards,
U.SivaKumar
Innovations are made when conventions are broken
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

I'm afraid that doesn't do the trick :
From server sv00127
#traceroute -n xw218
traceroute to xw218.dolmen.be (10.118.112.18), 30 hops max, 20 byte packets
1 10.101.1.25 1 ms 1 ms 1 ms
2 10.102.2.1 7 ms 1 ms 1 ms
3 192.168.30.1 2 ms 2 ms 2 ms
4 10.118.112.18 3 ms
... and hangs again

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Anil C. Sedha
Trusted Contributor

Re: traceroute hangs from one specific server

Tom,

Try to look for this patch

PHNE_23274

check if it is worth installing on your system. If it is, then go ahead as this patch resolves some nslookup issues and would help you in name resolution. It can hang due to that.

Regards,
Anil
If you need to learn, now is the best opportunity
Ron Kinner
Honored Contributor

Re: traceroute hangs from one specific server

Tom,

The fact that it does not return a star indicates that it's probably a software problem so I'd go with the patch idea.

However, just for grins, what does the traceroute in the other direction show? From xw218 back to sv00127 and sv00128? Does ping work? Is xw218 a Cisco device? Does traceroute -dv xw218etc give you any extra info?

Ron

PS Anybody know why there is no man page for traceroute on hpux?
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Anil,
I'm going to try PHNE_23274. However, since this is a very important production server, this is going to take a while (first have to go through test / acceptation). On top of that, sv00128 doesn't have that patch either ... so allow me to be a bit sceptical (even though the patch has three stars and sounds like the thing).
If it solves the problem ... I'll give you full marks !

Ron,
You definitely got a grin ... the problem is that xw218 is down most of the time (reason that the traceroute doesn't complete ;-) ... so I can't test your suggestions.
I've wondered about the missing manpage as well, so any answers on that will also get points (although I'll keep 'm back until the real issue is solved).

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Anil C. Sedha
Trusted Contributor

Re: traceroute hangs from one specific server

Hope it helps tom..

After all, we all are here to help. If it doesn't open a call with hp and that would help you in all probabilities.

Regards,
Anil
If you need to learn, now is the best opportunity
rick jones
Honored Contributor

Re: traceroute hangs from one specific server

you might consider taking a tusc system call trace of traceroute on the system where it hangs. get the latest rev of tusc (7.4) from ftp.cup.hp.com under dist/networking/tools/ and use the verbose option.

then you can simply look at the last system call(s) it makes and that may yield a clue as to where it is getting hung-up.
there is no rest for the wicked yet the virtuous have no pillows
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Hi Rick,
I've attached the output from tusc (attached at the moment traceroute hangs) ... it's Chinese to me. Can you/anyone else interpret
the output ?

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Alas, alas, the patch (PHNE_23274) does not solve the problem. Back to the drawing boards. Does anyone have any ideas left (I just hate opening calls ... :-) ?

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Paula J Frazer-Campbell
Honored Contributor

Re: traceroute hangs from one specific server

Tom

From your tusc output it is not stuck , but still working.


Have a look at the ip address in fields "sin_addr.s_addr" and see if they can point you to where the problem is.

Have you compared your routing tables ?


Paula
If you can spell SysAdmin then you is one - anon
Ron Kinner
Honored Contributor

Re: traceroute hangs from one specific server

try
traceroute -w 2 xw218

see if changing the wait time to the minimum helps any.

If you trace to an unused address on the same LAN as xw218 do you get the same hang?

Compare the output of
ping -o xw218
(after you stop it)
from both boxes.

Ron
rick jones
Honored Contributor

Re: traceroute hangs from one specific server

I forgot to mention that tusc by default only shows syscall exit, so if you enter a syscall (say select) that does not return, it could be a problem.

iirc, the -E option to tusc will show both entry and exit. You might also add a '-T ""' to the tusc command lines.

tusc does not seem to break-out the timeval struct in the select call, so I cannot see what is being passed-in for the timeout.

there is one oddity in the trace however - there are no sendto() or write() calls for each packet that is supposed to be triggering the ICMP's from the remote, and also some set/getsockopts related to setting the TTL in the ip header and such.
there is no rest for the wicked yet the virtuous have no pillows
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

All,
The problem seems to be that traceroute does not work (or loops, or searches the whole network, or ...) when a certain device is down.

Paula,
Routing is the same for both servers ...
The IP's in the output seem to be all the routers we've got over here (and we've got a couple ;-). It's not quite clear - for me - why all those IP's are in there.

Ron,
xw218 is down, ping -o doesn't work ...

Rick,
root/sv00127#/opt/tusc/bin/tusc -T "" -E traceroute xw218 > /var/adm/crash/tusc_xw218.txt
# Output in attachment (gzipped)

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Mmm the attachment seems not to have worked ...
Again
A life ? Cool ! Where can I download one of those from ?
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Hi again,
New development. It would seem that Paula is on the right track, traceroute on sv00127 does not hang but takes a lot longer and the duration time seems to fluctuate. Right now it takes longer ... but does end in a reasonable time.

In attachment (first one in this reply, second one in the next)the complete tusc-traces for both sv00127 and sv00128. Does anyone "see" what's wrong with the first one ?

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

And here's the second (correct one from sv00128) one.
Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Ravi_8
Honored Contributor

Re: traceroute hangs from one specific server

Hi, Tom

seems like that your n/w card is half duplex.

make it full duplex,
sam--> n/w and commn--> n/w interface--> choose your LAN card --> action-->modify

you may have to off the auto negotiate.
never give up
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Hi,
root/sv00127#lanadmin -x 0
Current Speed = 100 Full-Duplex Auto-Negotiation-OFF

root/sv00128#lanadmin -x 0
Current Speed = 100 Full-Duplex Auto-Negotiation-OFF

Seems to me like it's Full Duplex though ... or am I missing something ?

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Paula J Frazer-Campbell
Honored Contributor

Re: traceroute hangs from one specific server

Tom

Can you diff the two outputs from tusc and post the results.




Paula
If you can spell SysAdmin then you is one - anon
Paula J Frazer-Campbell
Honored Contributor

Re: traceroute hangs from one specific server

Tom

Forget the diff - I have done it:-

Things to look at :-

/etc/nsswitch.conf

Also st_mtime is showing differances of the servers - Are they patched the same?

root@d370/>diff 332.txt 322.txt | grep st_mt | grep 2000
< st_mtime: Fri Jan 7 01:09:53 2000
< st_mtime: Sat Jul 8 00:55:00 2000
< st_mtime: Fri Jan 7 01:09:53 2000
< st_mtime: Sat Jul 8 00:55:00 2000
< st_mtime: Sat Jul 8 00:55:00 2000
< st_mtime: Sat Jul 8 00:55:00 2000
< st_mtime: Fri Jan 7 01:09:53 2000
< st_mtime: Fri Jan 7 01:09:53 2000
< st_mtime: Sat Jul 8 00:55:00 2000
root@d370/>diff 332.txt 322.txt | grep st_mt | grep 2001
> st_mtime: Tue Nov 27 09:25:23 2001
> st_mtime: Tue Nov 27 09:25:23 2001
> st_mtime: Tue Nov 27 09:25:23 2001
> st_mtime: Tue Nov 27 09:25:23 2001
root@d370/>diff 332.txt 322.txt | grep st_mt | grep 2002
> st_mtime: Fri Apr 12 10:30:00 2002
> st_mtime: Fri Apr 12 10:30:00 2002
> st_mtime: Fri Apr 12 10:30:00 2002
> st_mtime: Fri Apr 12 10:30:00 2002
> st_mtime: Fri Apr 12 10:30:00 2002



Serach for file with the above time stamps it may point to a patch.

Paula

If you can spell SysAdmin then you is one - anon
rick jones
Honored Contributor

Re: traceroute hangs from one specific server

I may be wrong and become confused among all the attachments, but i don't think the two machines are running the same traceroute utility. One syscall trace still shows just those select/recvfrom calls and no sends to trigger the ICMPs, and the other is showing send/poll/etc calls.
there is no rest for the wicked yet the virtuous have no pillows
Tom Geudens
Honored Contributor

Re: traceroute hangs from one specific server

Hi,
An update after the long weekend. I've come to believe that the problem is ARP-cache corruption. I've noticed that the traceroute takes forever when it looks like this :
oroot/sv00127#arp -a
10.101.1.100 (10.101.1.100) at 0:0:c:7:ac:65 ether
10.115.16.87 (10.115.16.87) -- no entry

Where 10.115.16.87 is an IP that according to me (and the networkadmins) can not possibly be in the ARP-cache (even though it seems to be there).

Whenever it looks normal, like this :
oroot/sv00127#arp -a
10.101.1.100 (10.101.1.100) at 0:0:c:7:ac:65 ether
sv00226.dolmen.be (10.101.5.2) at 0:10:83:f5:45:54 ether
sv00224.dolmen.be (10.101.3.5) at 0:2:a5:8c:11:e6 ether
sv00229.dolmen.be (10.101.5.3) at 0:50:8b:a1:67:40 ether
sv00248.dolmen.be (10.101.2.3) at 0:10:83:fc:b2:53 ether
traceroute works fine.

And now we come to the one difference between this system and the other systems, namely that patch PHNE_23456 (or above) can NOT be installed on this system. Reason for this is that it is the Control-M killerpatch (if you install it, older versions of Control-M will no longer work). Guess what, PHNE_23456 is the ARPA cummulative patch :-)

At the moment the arp-cache is fine (and I do not know how to cause the corruption). I'd still like some verification though. Does this sound plausible (and if yes, can anyone verify it) ?

Regards,
Tom
A life ? Cool ! Where can I download one of those from ?
Ron Kinner
Honored Contributor

Re: traceroute hangs from one specific server

If you think you might have a bad entry in the ARP cache then the command
arp -d hostname
can be tried to remove the bad entry.

What I expect is happening to corrupt your arp table is there is a sometimes a problem
pinging your default router. On 11.0 this causes the route to be declared bad and it is removed from the route table.

If this happens your hp may try to ARP for the ethernet address and hope to receive a proxy arp reply which apparently doesn't happen. This puts an unresolved entry into your ARP cache. This should, I would expect, be removed at least after 5 minutes (from ndd -h):

arp_cleanup_interval:

The amount of time that non-permanent, resolved entries are permitted to remain in ARP's cache.[30000, 3600000]Default: 300000 (5 minutes). Or is that something that was broken before the patch you mentioned?

There is a parameter in ndd called: ip_ire_gw_probe which if you set it to 0 will stop testing the gateway. This may stop the proxy ARP business and keep your ARP table clean but I still wonder why the router is not responding. Is it perhaps at times overloaded with input traffic? I had a router with a 10 Half Duplex interface which was on a LAN where every one else was 100 Full Duplex. One process fired every 30 minutes and so overloaded the input to the router that nobody else on the LAN could get through to the router for anything else. Perhaps there is a similar process on the good box which is doing the same thing but because the process originates on the same box as the good traceroute, the local TCP/IP process forces them to share.

Ron

PS I still think it's a patch issue since a well written traceroute should just * out gracefully if it didn't get a reply.