Re: Telnet Connection Latency

Ralph Grothe · ‎11-06-2003

Hello,

actually I have this problem lingering for quite a while.
But every now and then the latency to establish a new connection seems to get unbearable to clients what makes them complain,
especially with regard to telnet connections.

The box is an N4000, has 8 CPUs, 10 GB RAM, runs HP-UX 11.00, and operates as an MC/SG cluster node that serves as an application server for another cluster node Oracle db server.

Socket-wise the machine is quite burdened as it has abt. 1800 established TCP sockets on average, most of which are connected to db-related application servers.
Only a very small minority of those are SSH, FTP, and Telnet sessions.
Also the average No. of used pseudo terminal is relatively low (well below the 60 as defined by kernel tunables npty, nstrpty).

I simply suspect that the sheer amount of sockets, which all claim receive and send buffers from system memory and probably other resources for maintainig (e.g. flushing, refilling) these, is too high, and that this impairs the establshing of new sockets.

On the other hand network related system metrics such as packet rates seem to lie in acceptable regions, i.e. no network bottlenecks alerts issued, net utilization <= 50%.

While looking for possible tunable candidates of the tcp stack I came accross this one

# ndd -get /dev/tcp tcp_conn_request_max
20
# ndd -h tcp_conn_request_max

tcp_conn_request_max:

Maximum number of outstanding inbound connection requests.
[1, - ] Default: 20 connections

I'm not quite sure about the impact of this tunable but I would consider a value of 20 a bit low considering the amount of sockets this box has to care for.

What do you think?

Rgds.
Ralph

Madness, thy name is system administration

Ron Kinner · ‎11-07-2003

Don't think that will help much. It just controls how many requests you can have waiting to be connected. If it were too small the people would complain about sometimes not being able to connect.

I believe telnet is one of those processes which like to see who they are talking to so you might see if you are getting timeouts with
nslookup
ClientName
then when you get a result try
ClientsIPaddress
and see what happens. exit to get out of nslookup.

ClientName is the name of one of the machines trying to telnet in. ClientsIPaddress is the IP address returned by the nslookup.

You might also verify that there are no network problems such as a mismatched duplex or WAN bottleneck by doing some pinging.

netstat -s

might be of some use too. Also
lanadmin
(think you choose Lan then Display but I don't have one to check. The second page of the Display shows errors. If you have multiple NICs then you will need to use the ppa x command to see NIC x)

Ron

rick jones · ‎11-07-2003

increasing tcp_conn_request_max is indicated when you see the "connection dropped due to full queue" statistic of netstat -p tcp incrementing:

ftp://ftp.cup.hp.com/dist/networking/briefs/

it depends not on the number of established connections, but on the rate of connection establishment

there is no rest for the wicked yet the virtuous have no pillows

Ralph Grothe · ‎11-18-2003

Hello again,

I'm terribly sorry, I have almost forgotten this thread although the problem is still latent.

Rick,

the way you explained the mentioned ndd tunable to me in fact was the way I had understood it, i.e. applying only to the process of establishing new sockets that simultaneously call connect().
Thus I agree that 20 is sufficient.
This is also in accordance with the low total of dropped connect requests due to full queue.

Here are the netstats

# netstat -p tcp
tcp:
1960302420 packets sent
1900413814 data packets (3364512051 bytes)
141263 data packets (72401168 bytes) retransmitted
59891421 ack-only packets (27754422 delayed)
94 URG only packets
872 window probe packets
4125 window update packets
2028557 control packets
2005867990 packets received
1869866279 acks (for 3364054490 bytes)
1136959 duplicate acks
247 acks for unsent data
1938570229 packets (924916297 bytes) received in-sequence
4 completely duplicate packets (108 bytes)
274985 packets with some dup, data (35119813 bytes duped)
300024 out of order packets (8508502 bytes)
105 packets (3222177244 bytes) of data after window
425 window probes
134105331 window update packets
403 packets received after close
560 segments discarded for bad checksum
0 bad TCP segments dropped due to state change
692168 connection requests
351613 connection accepts
1043781 connections established (including accepts)
1152735 connections closed (including 110387 drops)
92045 embryonic connections dropped
1868834879 segments updated rtt (of 1868834879 attempts)
122449 retransmit timeouts
2559 connections dropped by rexmit timeout
872 persist timeouts
13746 keepalive timeouts
12728 keepalive probes sent
45 connections dropped by keepalive
217 connect requests dropped due to full queue
13455 connect requests dropped due to no listener

But something else must be the limitting factor.
I mean each socket requires at least memory for providing receive and send buffer.
But when the connection latencies occured either memory as well as cpu usage was ok.

Is there some other configurable system global that limits the No. of sockets other than No. of file discriptors or open files?

Madness, thy name is system administration

Sanjay_6 · ‎11-18-2003

Hi Ralph,

I hope this has nothing to do with the reverse lookup that the system does for the ip address from which the telnet / ftp session is coming.

Also check and see if you need a telnetd / ARPA networking patch that might help you with this.

Hope this helps.

Regds

Cara Tock · ‎11-18-2003

I would look at dns and reverse lookup as suggested above. I had a problem with telnet connection latency a while back and it was due to this. As soon as I corrected the dns entries the problem went away.

rick jones · ‎11-18-2003

It is still OK to set tcp_conn_request_max to 1024. It wont hurt anything.

The suggestions to check DNS is a good one - that is the most common reason for taking along time to get to a login prompt with telnet and such.

On the system, examine the contents of your /etc/nsswitch.conf file. Then make sure that you can still ping your DNS servers -try them in order from the /etc/resolv.conf file. If one or more of them are not responding, fix it :)

As for sockets, yes, a socket will consume some memory, but it will _NOT_ preallocate space for sends and recv's. Data for sends is allocated on an as-needed basis and the data that is queued for recv is allocated by the NIC driver before the packet arrived.

there is no rest for the wicked yet the virtuous have no pillows

Ralph Grothe · ‎11-18-2003

You all are right that most of the times when a client has to wait for the login prompt for 2 minutes it is most of the times cause by some DNS reverse lookup problem (e.g. DNS server not reachable and no entries in /etc/hosts or other name sercvice like NIS or LDAP).
I don't think it is a DNS lookup problem but I will extend my script, which executes telnet sessions through CPAN's Net::Telnet and meassures times between open and close through Time::HighRes that get recorded to a logfile, to further do DNS name resolutions.

Btw, this is the nsswitch.conf on the login server, and we don't use other name services than DNS.
Our domain's primary and secondary DNS servers are listed as such in the /etc/resolv.conf

# grep -v ^# /etc/nsswitch.conf
passwd: files
group: files
hosts: files [NOTFOUND=continue] dns
services: files

In the /etc/hosts are only those IPs and FQNs and Nodenames of those hosts that are cluster nodes as well as those of the cluster's packages.

Madness, thy name is system administration

Bill Hassell · ‎11-19-2003

If you have 3 entries in /etc/resolv.conf *and* the delay occurs only at login *and* it is usually 90-120 seconds long, these are DNS server failures. Your DNS servers are simply disappeared and that is a critical problem. Reverse lookup is mandatory for security and as you mentioned, the delay goes away when the user's computer is located in /etc/hosts (and likely, /etc/nsswitch.conf says: use files first, then DNS). There is no solution other than to improve the reliability of the DNS servers. If you have 3 of them, it is unlikely that all of them crashed at the same time so you may have network routing problems, or possibly a hacker is running a denial of service attack against the servers. Both are serious problems that need to be addressed.

Bill Hassell, sysadmin

Ralph Grothe · ‎11-19-2003

Just to make sure it isn't a DNS problem.

I found that the IP address of a telnet client that is reporting the login problems isn't reversely resolvable by an nslookup issued on the HP-UX server that runs the telnetd.
But since the IP of this client should be in a domain that we delegated our DNS servers cannot resolve this name, and I'm convinced that the IP<->name resolution should be taken care of the client's authoritative DNS server viz. their own.

Could this be the cause for the login retardation?

Madness, thy name is system administration

Ralph Grothe · ‎11-19-2003

In further search for a hints that would lead me to a solution I had a read in
"HP-UX Internet Services Administratorâ s
Guide".
There I found the following paragraph concerning the optional security wrapper file /var/adm/inetd.sec for the inetd that spawns new telnetds on incoming requests.
What irritates me is that there they mention an upper limit of 1000 connections if not otherwise specified.
But since there doesn't exist a inetd.sec file at all I'm now not sure if this is applicable at all.
Here is the cited paragraph from the mentioned guide:

Maximum number of connections? The maximum
number of simultaneous connections is specified in the
optional file /var/adm/inetd.sec. When inetd is
configured, it checks this file to determine the number
of allowable incoming connections. Look at this file to
determine how many connections are allowed. The
default is 10

Madness, thy name is system administration

Massimo Bianchi · ‎11-19-2003

Try this: create, in each user home directory, a file called

.nslookuprc

and put therein these lines:
timeout=2
retry=2

This are used to set proper timesouts, and are used not only by nslookup but also from other programs...
This way it will try 2 times to connect and wait up to 2 seconds before giving up.

HTH,
Massimo

Bill Hassell · ‎11-20-2003

I don't remember seeing any parameter in inetd.sec that defines the number of connections and the man page for inetd.sec does not mention this either.

As far as reverse lookup, your DNS servers are failing to resolve reverse lookup by IP. You can prove that by querying each DNS server in resolve.conf with an IP followed by the name or IP of the desired server:

nslookup 12.34.56.78 dns_server1
nslookup 12.34.56.78 dns_server2
nslookup 12.34.56.78 dns_server3

Note that it is entirely too common to find reverse lookup missing in DNS servers, especially where the server is Windows-based.

Bill Hassell, sysadmin

Steven E. Protter · ‎11-20-2003

As Bill points out, the most common cause of telnet latency is DNS resolution delays.

It happened in my own private office. Until I got the Linux DNS server working properly(and opened up port 53 to the internal network in iptables), I had latency that I couldn't explain.

Also, I've seen latency with older Microsoft NT 4.0 Server DNS servers. For some reason they gave very slow answers at work until we upgraded them to W2K(shoulda used HP-UX).

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Tony Horton · ‎11-20-2003

Hi Ralph,

You said:
--------------------------------------------
Just to make sure it isn't a DNS problem.

I found that the IP address of a telnet client that is reporting the login problems isn't reversely resolvable by an nslookup issued on the HP-UX server that runs the telnetd.
But since the IP of this client should be in a domain that we delegated our DNS servers cannot resolve this name, and I'm convinced that the IP<->name resolution should be taken care of the client's authoritative DNS server viz. their own.

Could this be the cause for the login retardation?

--------------------------------------------

Absolutely! This is the problem that I think everyone above was aluding to when they mentioned DNS being the culprit. I had this problem myself (also shows up when you do a who -u). If there is no reverse zone then the server will time out, which takes quite a while. Make sure that the DNS that the server running telnetd (ie the one they are telneting to) uses has reverse zones for all of the nets telneting in and your problems should (hopefully) go away.

It doesn't matter if the client can successfully do a reverse lookup, it's the server that matters.

Regards,

Tony.

No man is an isthmus

Ralph Grothe · ‎11-23-2003

Hello again,

thanks for staying with this thread and your comments/suggestions.

To further collect data I deployed a little Perl script I wrote (see attachment) that does telnet connections to three cluster nodes (of which one is the server with the large latency at times) as well as their packages, and logs the times it took from open() till close().
I had it run over the weekend.
Because I knew that the problem usually occurs in the morning hours I also had the inetd on the telnetd server toggled to extended session logging (aka debug mode) by the -b switch durin 5 and 10 am through a cron job.
The point is that the server where my script runs (i.e. the telnet client) is in the same LAN as the server where the telnetd is started (i.e. telnet server), and that the host of the telnet client is definitely DNS name resolved (forward and reverse) since this host belongs to our domain, and thus both our DNS servers are authoritative for it.
But even with these prerequisites did my script log the dreaded latency.

When I filter for log records whose time between open and close took over a minute I get these

$ perl -ane 'printf"%-s%10s%8.2f%8.2f%8.2f%8.2f%10u\n",scalar localtime($F[0]),@F[1..6] if $F[5]>60' nohup.out
Sat Nov 22 08:29:21 2003 sapa 0.00 99.86 99.93 99.93 12115
Sat Nov 22 08:59:13 2003 saturn 0.00 98.36 98.45 98.45 13083
Sat Nov 22 09:05:12 2003 saturn 0.00 65.96 66.02 66.02 13348
Mon Nov 24 09:20:28 2003 saturn 0.00 149.18 149.26 149.26 14490

The last field above being the the PID of the respective telnetd on telnet server of this session.

From the syslog on the telnet server whose inetd at the time was under extended session logging I grepped the beginning of the forking of the telnetd PID, and its reaping by inetd's signal handler later.
I excluded the rest because the information in between isn't too revealing.
As you can see from the log time stamps this only took place after the waiting of the telnet client.

# 'wait3 returned pid=14490' /var/adm/syslog/daemon.log <
Nov 24 09:22:54 saturn inetd[709]: fork returned = 14490
Nov 24 09:22:58 saturn inetd[709]: reapchild(): wait3 returned pid=14490

Madness, thy name is system administration

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Telnet Connection Latency

Telnet Connection Latency