Operating System - HP-UX
1821584 Members
3403 Online
109633 Solutions
New Discussion юеВ

Sudden server disconnects; network debugging strategies requested

 
Ralph Grothe
Honored Contributor

Sudden server disconnects; network debugging strategies requested

Hello networkers,

I'm fully aware that it is almost impossible to expect some visitor to this forum who doesn't know our network infrastructure, network components, application interfaces etc. to aid in tracing the reason for the problems I'm faced with.
Nevertheless, maybe you can give me some general strategies or recipes to follow.

The symptoms are that clients who connect to one of our clustered DBMS are seemingly arbitrarily disconnected/kicked out.
The database affected (which is an Oracle instance) runs as a cluster (MC/SG) package which binds to the NIC lan2:6.
When I run lanadmin on ppa lan2 I can see no inbound or outbound errors, drops, collisions, or other packet discards.
The network guy from the client side says that his network components are working well, but that on sent icmp packets to the server (i.e. the IP of the package that disconnects) he receives a "source quench" which he says is prove to him that definitely the server is the
cause.
Not knowing the network lingo I looked up in an internet dictionary what commonly is understood by "source quench".
There I read that it simply is a request from the receiving side to the sender to send the packets at a lower pace (which to me implies too heavy load on the receiver). It also read that routers are not obliged to act on "source quench" requests.

Hm, I'm not able to discover any trouble with the NIC.
Apart from lanadmin queries that to me revealed no malfunctioning a mere
"netstat -I lan2:6 -in"
reports 332008089 inbound and 326117719 outbound packets (since bringing the NIC up?).
This is an outbound to inbound ratio of some 98%.
I'm not sure if this ratio is meaningful at all.
I only realized that unfortunately the HP-UX netstat had no extra columns to account for errors and collisions like the versions from Solaris or Linux do.

I rather suspect the servers from the application side that are spawned through inetd to be the culprit.
Unfortunately I have no access means (logwise, debugging mode etc.) to the application to seek evidence because I've never been provided with details about the working of these servers by the customer who introduced this application.

All I can see are the establishments of connections in the syslog.log because inetd was started with "-l" flag.

Because the ports for these services were registered in /etc/services I know them and can grep for them on a casual
"netstat -anf inet",
which at the moment gives me some 45 established sockets.
But how can I find out when and why a disconnection occurs?
Unfortunately inetd only logs new connections in syslog.log but not when a connection suddenly severs.

To get a better overview I installed some freeware network tools on the box (e.g. lsof, libpcap, nmap, ntop, tcpdump).
Unfortunately I have little experience in using these tools efficiently.
Can someone give me some hints how to locate the source of the sudden disconnects.

Many thanks for your patience

Ralph
Madness, thy name is system administration
6 REPLIES 6
Stefan Farrelly
Honored Contributor

Re: Sudden server disconnects; network debugging strategies requested

Ralph,

The source quench problem can be easily sorted out;

From the command line:
ndd -set /dev/ip ip_send_source_quench 0
OR
Put:
TRANSPORT_NAME[0]=ip
NDD_NAME[0]=ip_send_source_quench
NDD_VALUE[0]=0
in your /etc/rc.config.d/nddconf to set it on startup.

Heres some more info on source quench;

Document ID : DCE19981119001



Problem Description

Are these Source Quench Messages something that I need to worry about?

Solution

This problem has been identified and is addressed in SR 5003435396. This
problem will be fixed in the 11.01 version of the HP-UX operating
system. These messages can be safely ignored as they have absolutely no
impact on the operating system (performance or otherwise). Alternatively
these messages
can be prevented by disabling source quench. For more information see
the sections below.

What is causing these messages?

At 11.0 the Streams Xport layer now passes the ICMP echo request to any
other process that has a socket open and bound to raw IP. The rpcd
rpcd/dced deamon opens a raw socket to listen to ICMP messages. This raw
socket is open by icmp_monitor routine of rpcd. The main function of
this routine is to check for error messages from dce servers registered
in endpoint database of the host and it checks the socket every 5
minutes. It does not respond to or use the ICMP echo requests However
the socket queue becomes filled during the 5 minute delay causing the
source quench message. The fix being implemented in 11.01 will be to
increase the buffer size to 128k and shorten the wait interval from 5
minutes to 2 minutes thereby flushing the queue of these unwanted
messages before the queue becomes filled.

Why is it safe to ignore these messages or to turn them off?

A good disscussion of this is in TCPIP Illustrated Volume 1 (by Richard
Stevens) page 160-162

Here is a Clip from page 161

"Although RFC 1009 [Braden and Postal 1987] requires a router to
generate source quenches when it runs out of buffers, the new router
Requirements RFC [Almquist 1993] changes this and says that a router
must not originate source quench errors. The current feeling is to
deprecate the source quench error, since it consumes network bandwidth
and is an ineffective and unfair fix for congestion."

Also see RFC 1812 section 4.3.3.3 Source Quench (this is good discussion)


As for other reasons for network disconnects check this out; we get this type of problem more than source quench problems.

http://searchnetworking.techtarget.com/tip/1,289483,sid7_gci802539,00.html

Im from Palmerston North, New Zealand, but somehow ended up in London...
Printaporn_1
Esteemed Contributor

Re: Sudden server disconnects; network debugging strategies requested

Hi Ralph,

Why not start with nettl facility
#netfmt -t 50 -f /var/adm/nettl.LOG00 > /tmp/nettl

then check /tmp/nettl for error message,
any duplicate IP ?

also check syslog , does any error from Service Guard ?
enjoy any little thing in my life
Steven Gillard_2
Honored Contributor

Re: Sudden server disconnects; network debugging strategies requested

This is always going to be a tough one to troubleshoot. This smells a bit like a server application problem to me as well - ie the server process that is spawned from inetd is exiting unexpectedly (maybe core dumping) and taking the connection down with it.

First I would look around for core files - if you find any use the 'file' command on them to work out if they're from your suspect process.

Secondly, grab a copy of 'tusc' and see if you can get a system call trace of one of these processes during a connection failure (easier said that done I know).

Regards,
Steve
Ron Kinner
Honored Contributor

Re: Sudden server disconnects; network debugging strategies requested

Fix the source quench problem first since it makes it hard to troubleshoot. Stefan is correct about source quench. We had the same problem.

Assuming your network guys has a Cisco router have him run an extended ping and sweep range of sizes. This will cause it to send a long series of pings from the
minimum size up to the maximum that Cisco supports. If he gets random failures then you might want to look at your NIC. Ours turned out to be sensitive to electromagnetic interference.
If you pass the extended ping with sweep range of sizes then it's not a network problem.

When looking at the tcpdump output search for " R " (R with a space in front and in back) which indicates a reset was sent. I expect you will see one when a connection drops unexpectedly. If you don't see any I would expect that the application hung up properly and then investigate why the application decided to say bye bye.

You might also look at netstat -a right after a drop and see what state the connection is in.

Ron
sven verhaegen
Respected Contributor

Re: Sudden server disconnects; network debugging strategies requested

HI

I tend to agree on the source quenche de-activation strategy first , I have very good reason to do so because I am well aware of several problems at HP customers with this feature of 11.x , in normal circumstances it wouldn't hinder the machine but in some extreem cases (high load machines) I noticed that the source quenching became disruptive , meaning it slowed communications down .. what you could have here is totally normal performance of the networking being stepped down due to the ICMP and the end machine giving up on the connection because it times out after several negative replies or drops on packets ..

first take that step and if the problem persist it request that we see where the connection gets broken client or server side , this will generally mean tracing , first do make a check on the systelog for any 'connection reset by peer' messages it could still point to an end client issue , if non are visible start tracing the problem with whatever tool at your disposition but try to limit the tracing as it can grow huge , I hope you can easily reproduce the problem and you don't need to transfer 300mb of data before it occurs , I'de go for a PC with netmon or something like that actively scanning the network untill a user yells "disconnected" stopping the tracing at that point , filtering out only that traffic and look at the last packet sequence you would then know who break the communication and why e.g. bad packets , no reply , out of time , retransmission failure .. all are possible
...knowing one ignores a greath many things is the first step to wisdom...
Ralph Grothe
Honored Contributor

Re: Sudden server disconnects; network debugging strategies requested

Hello to all responders,

many thanks for your valuable suggestions.

Unfortunately I couldn't give any feedback yesterday, because the ITRC webserver only gave me a chance to assign points, and afterwards didn't serve my request for the reply form any more.
I think most of you reside in the USA.
So I wonder if you have similar trouble with your ITRC access.
I, here from Berlin, Germany continously have trouble to access ITRC after abt. 12:00 CET, although I'm coming over the European httpd dispatcher.

Now back to our network problem.
Parallel I had a call yesterday to the HP Support centre in Ratingen, Germany.
This was after I had read the very informative reply from Stefan where he suggested to disable the creation of source quenchs of the network driver through ndd.
I also mentioned your suggestion to the supporter, but he wasn't too convinced to disable SQs altogether.
He rather suggested to me to install subsystem patch PHSS_21614, that is said to increase the buffer size to 128 KB and thus reduce the churning out of SQs considerably.
Then he also gave me some hints what to perform to test the stability of the network connection.
Since, as I wrote, these servers are started by inetd, he also told me that there was an undocumented switch "-b" for the inetd which sets it into debugging mode and lets it log more verbosely into syslog.log.
So when a client encounters a disconnect next time I will restart inetd in this mode.

Stefan, the URL you supplied is great,
and I carefully read what was written there about causes for duplex and speed mismatches.
This in mind I checked the NIC settings of the server against the port settings of the switch where the server is plugged in.
Both were set to autonegotiation, full duplex 100 Mbps.

Madness, thy name is system administration