Re: Still that telnet latency

Ralph Grothe · ‎01-26-2004

Hi,

I had posted in this matter several times in the past.
Due to high work load in fixing problems on other systems and deploying new systems on new platforms (e.g. Veritas Cluster Server on Solaris, which is new cluster sw, volume manager and os to me) I haven't been too persistent in fixing this one.
Above all I'm clueless.

Some of you as well as HP Support told me that there was nothing left to configure for TCP services like telnet, and that I probably had a routing problem.

To exclude such influences I modified my telnet checker script (uses CPAN's Net::Telnet module) such that the daemon now runs soley on the box affected and does nothing but telnet sessions to IP addresses as well as DNS FQNs on itself, i.e. IPs and names of host and virtual IPs and names of MC/SC packages running on this host, round robin in 180 sec intervals.
So now there should be no routing involved whatsoever.
The FQNs are listed in /etc/hosts and /etc/nsswitch.conf instructs procs that use the resolver libcalls to query files before dns.
The script now also logs the No. of parallel ESTABLISHED sockets to local port 23 (i.e. other telnetters) besides the connection times.
This daemon has been self-connecting to the listening telnet socket for last week.
And still the log clearly exhibits connection times well above 1 min at times (mainly around 09:00 AM).
I looked through all sorts of system metrics logged by PerfView but cannot find any evidence of the system being buckled at the intervals when telnet showed latency.
In fact the peaks appear much later in the day.
The No. of parallel (established) telnet sessions at those times had been 5 at maximum.
So no lack of ptys either.
OK, I haven't yet logged the ARP table through my script.
I'm clueless where else to look.
I also had asked in that contents how to monitor the TCP Queues, but my version of the TCP driver seems broken, when it comes to querying /dev/tcp for tcp_status.

I would be very grateful for suggestions what else I could monitor to trace this problem.

Rgds.
Ralph

Madness, thy name is system administration

Sridhar Bhaskarla · ‎01-26-2004

Hi Ralph,

Could you reproduce the problem manually by doing a 'telnet localhost'? at the time?

Did you try to run 'tusc' on the inetd process (with follow forks option) around 9 AM and see where exactly the delay is occuring?.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Steven E. Protter · ‎01-26-2004

Install this:

http://hpux.connect.org.uk/hppd/hpux/Gtk/Applications/ethereal-0.9.15/

In X run ethereal and collect all the network packet data on your box as this problem happens.

Then go over what you get.

You will find clues.

You may need a network guy to go over the data.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Seth Parker · ‎01-26-2004

Ralph,

Do you use the -TCP_DELAY option in inetd.conf? If not, give it a try. It helped in my case after we first converted to 11.0. Regardless, it couldn't hurt if you haven't tried it.

Here's a link to a nice thread:
http://forums1.itrc.hp.com/service/forums/parseCurl.do?CURL=%2Fcm%2FQuestionAnswer%2F1%2C%2C0x8f8c663ce855d511abcd0090277a778c%2C00.html&admit=716493758+1075177353267+28353475

Good luck!
Seth

Dietmar Konermann · ‎01-26-2004

Hi, Ralph!

I'm not a networking guru... consequently tusc would be the tool I would start with. :)

I assume that the problem is also reproducible on another port that telnet's 23. In this case I would configure an alternate telnet port for trouble-shooting... makes things less dangerous for production.

/etc/services:
telnetx 10023/tcp

/etc/inetd.conf:
telnetx stream tcp6 nowait root /usr/lbin/telnetdwrapper telnetd

/usr/lbin/telnetdwrapper:
#!/sbin/sh
exec /usr/contrib/bin/tusc -faken -T "" -o /tmp/telnetd.$$ /usr/lbin/telnetd "$@"

This gives you a quite comprehensive tusc result for each connection to port 10023/tcp. Running your perl script to this port could reveal some intereting details. Be sure to have enough space in the target directory.

Best regards...
Dietmar.

"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)

Elmar P. Kolkman · ‎01-26-2004

One more thing: telnet is not the only proces using pty's. Perhaps monitoring on other established connections might give more insight.

Every problem has at least one solution. Only some solutions are harder to find.

Jakes Louw · ‎01-27-2004

Just a flying try here:

check out the failed logins using "lastb":
perhaps there's some person trying to connect to the telnet port from another server. These will show up in the lastb output as garbage userids.

Trying is the first step to failure - Homer Simpson

Ralph Grothe · ‎01-28-2004

Hello friends,

many thanks for your valuable suggestions.

I do apologize for my belated response.
But I was totally occupied the whole day yesterday where we almost were struck by disaster when one in a million incidence occured that both controllers of a HP Virtual Raid System simultaneously failed.
One bit the dust completely while the other wasn't usable because of its cache's DIMM failure.
It may well have been that the DIMM failure went unnoticed already earlier.
Because this RAID is an isolated storage solution denying our SAN infrastructure and was imposed by the customer, I guess we've been too careless with documenting this RAID's layout.
To make a sad long story short, the HP SE who was called was fortunate in recovering the RAID from the VFB of the controller's firmware after hours of fruitless trials.

Shri and Dietmar,

thanks for pointing me at tusc.
It sounds to me to be some sort of syscall stack tracer.
I haven't played with it before, and haven't yet googled for it's homepage.
I guess it is OpenSource?
At least I cannot find it installed on the affected box.
I already had done something similar long ago with an earlier version of my script where I had the PIDs of the telnet sessions on the server logged, while I had the inetd running in debug/verbose mode (i.e. the -b toggle).
But this wasn't really instructive since the logging of the spawning inetd only started after the latency period (by comparison of the timestamps), and gave no evidence for the cause of the blocks.
I will surely get and install tusc and see what it reveals.

Steven,

I hoped to avoid sniffing but probably the only remedy.
I understand ethereal is more cosy to use than nettl and netfmt.
So far I have only made acquaitance with Solaris' snoop, which I think is similar to tcpdump.
But it couldn't hurt to install the libpcap anyway.

Elmar,

I know, but I don't suspect greater usage of ptys from other services.
But I could also log the system tables for possible hits of boundaries such as files, procs, ttys etc.

Jakes,

last didn't reveal such logins.

Will be back soon, time permitting

Ralph

Madness, thy name is system administration

Sridhar Bhaskarla · ‎01-28-2004

Ralph,

YOu can get it from the porting center.

http://hpux.cs.utah.edu/hppd/hpux/Sysadmin/tusc-7.5/

Yes. It's like truss on SUN to follow the system calls. TO use it with inetd, you will need follow the forks as telnetd is forked by inted.

It's a good to have tool like lsof.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Ralph Grothe · ‎01-29-2004

Hello Dietmar,

thank you for your smashing hint to set up another telnetd on a free port and have it exposed to tusc's syscall tracking.

I downloaded a precompiled binary from the HP Porters and installed it.

Then I set up a 2nd telnetd/inetd listening socket almost exactly as you outlined it.
I only added the triple -c switch in tusc's invocation in order to have elapsed CPU times per syscall as well logged.

tusc itself got installed here

# uname -srv
HP-UX B.11.00 U
# ll /usr/local/bin/tusc
-rwx--x--x 1 root sys 819200 May 20 2003 /usr/local/bin/tusc

I registered the suggested port 10023 as it hasn't been used yet by another service (and because I liked the celebral link to the original telnet port ("donkey's bridge" as we Germans call it ;-)) under the name telnetdb

# grep telnet /etc/services
telnet 23/tcp # Virtual Terminal Protocol
# debug wrapper for telnet analysis
telnetdb 10023/tcp

# grep telnet /etc/services
telnet 23/tcp # Virtual Terminal Protocol
# debug wrapper for telnet analysis
telnetdb 10023/tcp
# grep telnet /etc/inetd.conf
telnet stream tcp nowait root /usr/lbin/telnetd telnetd -b /etc/issue
telnetdb stream tcp nowait root /usr/local/sbin/tusctelnet telnetd

As you can see I called my wrapper (deviating) tusctelnet, and placed it in this path and made it executable to root

# ll /usr/local/sbin/tusctelnet
-rwxr--r-- 1 root sys 114 Jan 30 00:32 /usr/local/sbin/tusctelnet

The script's content is almost identical.
I only changed the default timestamp format according to strftime() to make it a bit more readable.

# cat /usr/local/sbin/tusctelnet
#!/sbin/sh
exec /usr/local/bin/tusc -fkaenccc -T "%Y/%m/%d %H:%M:%S" -o /tmp/tusctelnet.$$ /usr/lbin/telnetd "$@"

I then asked inetd to reread its configuration by issuing

/usr/sbin/inetd -c

and checked in syslog.log that it recognised the new telnetdb.

But all tests (so far only manually to see if it works) to connect to the new socket on localhost fail with "incorrect login"

# telnet localhost 10023
Trying...
Connected to localhost.
Escape character is '^]'.
Telnet TERMINAL-SPEED option ON
Local flow control on

HP-UX saturn B.11.00 U 9000/800 (tm)

login: topx
Login incorrect
login: topx
Login incorrect
login: topx
Login incorrect
Connection closed by foreign host.
# echo $?
1

# netstat -anfinet|grep 10023
tcp 0 0 127.0.0.1.10023 127.0.0.1.61879 TIME_WAIT
tcp 0 0 *.10023 *.* LISTEN

Because I suspected the exec call in the wrapper script, since execs replace the current process by the process passed as 1st argument.
But of course, the removal didn't change a thing.

Could you give me a clue what's going wrong?

The inspection of the newly created logfile from tusc (as set through -o ...) wasn't too revealing either.
Because of the bulk of output produced I fear to lose the overview, and there will be a lot to wade through.

Besides, a few other silly questions.
Why should the new telnetdb started as root process?
Since I want to bind it to a socket well beyond 1024 there should be no need for it.
Or does the tusc tracker/debugger require this?

And then, shouldn't instead of telnetd inetd be tusc'ed, despite the even bigger blow up of logging output?

Still many questions.
Hope you'd find the time to drop a quick line.

Ralph

Madness, thy name is system administration

Dietmar Konermann · ‎01-30-2004

Hi, Ralph!

Oops... I'm sorry. You are right, this wrapped telnetd does not work. Looks like some security feature; you see the same if you use a wrapper without "exec"... then login fails even without tusc.

And running tusc on inetd could be really too verbose. :) Didn't you say the delay happens _before_ the login: prompt appears? Why not use the data if if the login fails?

BTW, maybe this works as non-root also... I didn't test it.

Best regards...
Dietmar.

"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)

Dietmar Konermann · ‎02-10-2004

Ralph,

I see you've just logged a case here in Ratingen. I will point the owning engineer to this thread. :)

Best regards...
Dietmar.

"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)

Ralph Grothe · ‎02-10-2004

Dietmar,

I've just talked to your colleague, and he seems to know the cause.

The inetd is running with the "-l" option and is serving a lot of clients from unknown origin where it cannot reversely resolve the FQN from their IPs.

So what we need is, because the customer requires inetd's session logging, some patch that supplies inetd with something like the "-n" switch for netstat, and that prevents it from trying reverse resolution.

Madness, thy name is system administration

Jochen Heuer · ‎02-10-2004

Hi there,

I am the engineer who got the call from Mr. Rothe. To me the problem seems to be inetd hanging while trying to do the reverse lookup of the incoming ip address (inetd -l -> every connection is logged with the resolved name). Since inetd is single threaded local connections (whose name does resolve quickly) can be delayed too.

The fix is to install a current inetd patch (>= PHNE_28312) and use the '-s' option too. Then the connection is still logged but only with ip address.

Best regards,

Jochen

Well, yeah ... I suppose there's no point in getting greedy, is there?

Jochen Heuer · ‎02-10-2004

Hi Ralph,

I did not see your post but that sums it up exactly :)

Best regards,

Jochen

Well, yeah ... I suppose there's no point in getting greedy, is there?

Ralph Grothe · ‎02-10-2004

Since the patches README, especially this section

2. JAGad01042 /SR 8606131892
inetd -l causes inbound connection delay if the hostname
lookup required for logging is slow.

Resolution:
A new option "-s" is provided to suppress the hostname in
the connection log message.

sound very promissing to me, I'm pretty faithful that this damned patch will solve our problem.
Btw, it's interesting to note that the README bears this header as the date of posting/release of the patch

# grep -i post\ date PHNE_28312.text
Post Date: 03/04/30

So probably no wonder no one was able to help me when I first submitted a support call to HPs' in this matter.
I'm almost tempted to suspect that I'm one of the many customers that finally made HP come up with this patch (possably the drop that spilled the barrel, as a German saying goes ;-)

But checking the patch's list of prerequisite fellow patches (the beloved PHKL rebooters) I had to conclude that we still lack a few.
This calls for another downtime.

As soon as I get the customers' placet I will be popping back into this thread to assign Jochen his deserved rabbits.

Madness, thy name is system administration

Ralph Grothe · ‎02-15-2004

Jochen,

this morning I checked again the logfile of my telnet checking daemon.
The longest latency has been 3.75 secs since its restart after bringing inetd in hostname-ignore-mode through SIGPIPE.

I'm now pretty confident that I can cease the checks.

Many, many thanks for supplying the patch :-)

Madness, thy name is system administration

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Still that telnet latency

Still that telnet latency