Re: Question about tcp_keepalive_interval (user sessions dropping)

JDR45 · ‎06-21-2016

Hello,

We use Humminbird Host Explorer to connect to HP RP3410 running HP-UX 11.11

User sessions are dropping. Sometimes inactive sessions, sometimes active sessions.

I'm wondering if the tcp_keepalive_interval setting would help keep these sessions from dropping?

7,200,000 is the default value, equal to 120 minutes.

Our server is set to 1,800,000, or 30 minutes.

The part I don't get- and this could be a silly question- is do I want this value set high or low to make sure the server and the users on Host Explorer stay connected all day and don't get dropped?

Thanks much!

Bill Hassell · ‎06-21-2016

Need a bit more information. I assume that this program is a high end terminal emulator using telnet or ssh, correct? How are the users interacting with the system: a simple shell running various commands like ps, bdf or vi? Or are they running some menu program? Did the sysadmin set a shell timeout (hint: echo $TMOUT) for automatic logout for idle sessions? Are there error message in syslog pointing to these disconnects? DOes dmesg report anything about networking?

Bill Hassell, sysadmin

JDR45 · ‎06-22-2016

OK, Hummingbird Host Explorer is a terminal emulator that allows the users at the remote location to connect to the unix server using telnet.

The remote users see a custom menu system that let's them run some business software. They don't get a command prompt or run unix commands. The users log on in the morning and stay on the system all day.

I can log on as root from work and from home using the same Hummingbird software and stay on all day with no dropped connections. My laptop even went into sleep mode once and I still stayed connected to the server.

TMOUT is set to zero.

Syslog is clear, dmesg is clear. I turned on nettl logging / netfmt and that is clear. There isn't a single clue on the server itself pointing to any problems. Everything on the server looks happy. I have a feeling this might turn out to be more of a networking problem at the remote location instead of something wrong with the server.

Just thought I would try tinkering with tcp_keepalive_interval. I just can't tell if a big or small number (say 7,200,000 vs 1,800,000) would help?

Thanks.

JDR45 · ‎06-22-2016

I found this bit of info-

"By default keepalive is set to 7,200,000. This means that every two hours the server tests the idle TCP connection by pinging the client. If the server gets no response from the client the keepalive terminates the idle connection."

Based on that I changed the value from 1,800,000 to 7,200,000. That way it will check every two hours instead of every 30 minutes, since we don't want idle connections dropped. Will see if that helps.

Bill Hassell · ‎06-22-2016

Using ping as a connection tester is crude at best, especially if there a single missed ping is a failure. Ping, unlike a TCP connection ignores dropped packets, that is, it will not retry. So a single missed ping is a bad test for connectivity and since your problem connections are remote, it may completely normal for occasional dropped packets.

So a very long tcp_keepalive_interval would be recommended for wide area connections.

Bill Hassell, sysadmin

donna hofmeister · ‎06-22-2016

Please do a "netstat -s" wait 5 minutes then do another one. Please post the results here.

JDR45 · ‎06-22-2016

Here goes, netstat -s

tcp:

4447178 packets sent

3213910 data packets (786369784 bytes)

37563 data packets (4911575 bytes) retransmitted

1233466 ack-only packets (1158742 delayed)

0 URG only packets

0 window probe packets

2 window update packets

442587 control packets

4035776 packets received

2595631 acks (for 788833202 bytes)

1610 duplicate acks

0 acks for unsent data

2272119 packets (231529981 bytes) received in-sequence

1 completely duplicate packet (119 bytes)

87 packets with some dup, data (32057 bytes duped)

8151 out of order packets (5139464 bytes)

27 packets (3284241333 bytes) of data after window

0 window probes

17108 window update packets

18 packets received after close

7 segments discarded for bad checksum

0 bad TCP segments dropped due to state change

59830 connection requests

17612 connection accepts

77442 connections established (including accepts)

123247 connections closed (including 45820 drops)

44865 embryonic connections dropped

2428882 segments updated rtt (of 2428882 attempts)

211803 retransmit timeouts

44754 connections dropped by rexmit timeout

0 persist timeouts

158864 keepalive timeouts

152585 keepalive probes sent

87 connections dropped by keepalive

0 connect requests dropped due to full queue

1898 connect requests dropped due to no listener

0 suspect connect requests dropped due to aging

0 suspect connect requests dropped due to rate

udp:

0 incomplete headers

0 bad checksums

0 socket overflows

ip:

4307267 total packets received

0 bad IP headers

0 fragments received

0 fragments dropped (dup or out of space)

0 fragments dropped after timeout

0 packets forwarded

0 packets not forwardable

icmp:

79 calls to generate an ICMP error message

0 ICMP messages dropped

Output histogram:

echo reply: 78

destination unreachable: 1

source quench: 0

routing redirect: 0

echo: 0

time exceeded: 0

parameter problem: 0

time stamp: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

0 bad ICMP messages

Input histogram:

echo reply: 55753

destination unreachable: 12

source quench: 0

routing redirect: 0

echo: 78

time exceeded: 0

parameter problem: 0

time stamp request: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

78 responses sent

igmp:

0 messages received

0 messages received with too few bytes

0 messages received with bad checksum

0 membership queries received

0 membership queries received with incorrect fields(s)

0 membership reports received

0 membership reports received with incorrect field(s)

0 membership reports received for groups to which this host belongs

0 membership reports sent

JDR45 · ‎06-22-2016

And five minutes later-

tcp:

4448640 packets sent

3214860 data packets (786533237 bytes)

37569 data packets (4911581 bytes) retransmitted

1233978 ack-only packets (1159250 delayed)

0 URG only packets

0 window probe packets

2 window update packets

442623 control packets

4036936 packets received

2596376 acks (for 788996672 bytes)

1610 duplicate acks

0 acks for unsent data

2272784 packets (231531798 bytes) received in-sequence

1 completely duplicate packet (119 bytes)

87 packets with some dup, data (32057 bytes duped)

8151 out of order packets (5139464 bytes)

27 packets (3284241333 bytes) of data after window

0 window probes

17111 window update packets

18 packets received after close

7 segments discarded for bad checksum

0 bad TCP segments dropped due to state change

59833 connection requests

17615 connection accepts

77448 connections established (including accepts)

123256 connections closed (including 45822 drops)

44867 embryonic connections dropped

2429609 segments updated rtt (of 2429609 attempts)

211821 retransmit timeouts

44756 connections dropped by rexmit timeout

0 persist timeouts

158879 keepalive timeouts

152600 keepalive probes sent

87 connections dropped by keepalive

0 connect requests dropped due to full queue

1898 connect requests dropped due to no listener

0 suspect connect requests dropped due to aging

0 suspect connect requests dropped due to rate

udp:

0 incomplete headers

0 bad checksums

0 socket overflows

ip:

4308442 total packets received

0 bad IP headers

0 fragments received

0 fragments dropped (dup or out of space)

0 fragments dropped after timeout

0 packets forwarded

0 packets not forwardable

icmp:

79 calls to generate an ICMP error message

0 ICMP messages dropped

Output histogram:

echo reply: 78

destination unreachable: 1

source quench: 0

routing redirect: 0

echo: 0

time exceeded: 0

parameter problem: 0

time stamp: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

0 bad ICMP messages

Input histogram:

echo reply: 55757

destination unreachable: 12

source quench: 0

routing redirect: 0

echo: 78

time exceeded: 0

parameter problem: 0

time stamp request: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

78 responses sent

igmp:

0 messages received

0 messages received with too few bytes

0 messages received with bad checksum

0 membership queries received

0 membership queries received with incorrect fields(s)

0 membership reports received

0 membership reports received with incorrect field(s)

0 membership reports received for groups to which this host belongs

0 membership reports sent

JDR45 · ‎06-22-2016

Thanks for the info on netstat -s, I've never tried that particular version of that command. :)

donna hofmeister · ‎06-23-2016

Thanks!

So the thing with netstat results is looking at them once is pretty meaningless. However, running like how I asked begins to give a picture of what's happening on the system. When you subtract the first set of numbers from the 2nd/later you can see the delta/change.

The other piece to the puzzle is this white paper: http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=c02020743&lang=en-us&cc=us -- and App. A in particular.

Here's the tcp deltas:

tcp:
1462 packets sent
950 data packets (163453 bytes)
6 data packets (6 bytes) retransmitted
512 ack-only packets (508 delayed)
0 URG only packets
0 window probe packets
0 window update packets
36 control packets
1160 packets received
745 acks (for 163470 bytes)
0 duplicate acks
0 acks for unsent data
665 packets (1817 bytes) received in-sequence
0 completely duplicate packet (0 bytes)
0 packets with some dup, data (0 bytes duped)
0 out of order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
3 window update packets
0 packets received after close
0 segments discarded for bad checksum
0 bad TCP segments dropped due to state change
3 connection requests
3 connection accepts
6 connections established (including accepts)
9 connections closed (including 2 drops)
2 embryonic connections dropped
727 segments updated rtt (of 727 attempts)
18 retransmit timeouts
2 connections dropped by rexmit timeout
0 persist timeouts
15 keepalive timeouts
15 keepalive probes sent
0 connections dropped by keepalive
0 connect requests dropped due to full queue
0 connect requests dropped due to no listener
0 suspect connect requests dropped due to aging
0 suspect connect requests dropped due to rate

At the time that you ran netstat, it appears there wasn't a lot happening (your above numbers a kinda small) and so I don't think there's any "Ah! Ha!" moments..... Having said that, you can see that there might be some "interesting" numbers. In particular 2 sessions were dropped. Why? It could be that the system could no longer reach <whatever> and so closed the connection. Why was <whatever> no longer there? Someone could have closed their laptop and left the building. It could also be that someone thought their session was hung (when it was really unresponsive) and ungracefully exited ("x-ing" out of hummingbird).

My advice is

find a time when the system has more users on it
run netstat -s
<wait> (how long isn't really important, but at least 5 minutes)
run netstat -s again

Do <something> to compare the numbers (drop them into excel? diff?) and compare the results to the disussion in the above white paper.

Finally, keep in mind on a really busy network -- the issue is NOT your system, rather it's the network itself. Think of trying to get onto the interstate during rush -- that's exactly what's happening with your packets. Armed with a netstat analysis, it may be possible to go to your networking team and say "Hey!" (as wonderful as networking teams can be they do seem to have a reputation for saying "there are no networking problems" :-)

JDR45 · ‎06-24-2016

Thanks for the links to the white papers. WIll give them a read. This server barely gets any use- 18 users max- which is another reason why I don't get why the remote user sessions drop.. It isn't from a heavy load.

Will keep track of netstat -s and keep an eye on the certain interesting numbers you mentioned.

Here's a fresh one from today, June 24-

------------------------------------------------------------------------------------------------------------------------

tcp:

4546858 packets sent

3285808 data packets (807571499 bytes)

39086 data packets (5025188 bytes) retransmitted

1261252 ack-only packets (1184862 delayed)

0 URG only packets

0 window probe packets

2 window update packets

452435 control packets

4127369 packets received

2651991 acks (for 810062267 bytes)

1651 duplicate acks

0 acks for unsent data

2323447 packets (237362168 bytes) received in-sequence

1 completely duplicate packet (119 bytes)

89 packets with some dup, data (33129 bytes duped)

8339 out of order packets (5251771 bytes)

27 packets (3284241333 bytes) of data after window

0 window probes

17433 window update packets

18 packets received after close

7 segments discarded for bad checksum

0 bad TCP segments dropped due to state change

60942 connection requests

17996 connection accepts

78938 connections established (including accepts)

125648 connections closed (including 46724 drops)

45686 embryonic connections dropped

2480824 segments updated rtt (of 2480824 attempts)

216572 retransmit timeouts

45582 connections dropped by rexmit timeout

0 persist timeouts

162795 keepalive timeouts

156413 keepalive probes sent

149 connections dropped by keepalive

0 connect requests dropped due to full queue

2159 connect requests dropped due to no listener

0 suspect connect requests dropped due to aging

0 suspect connect requests dropped due to rate

udp:

0 incomplete headers

0 bad checksums

0 socket overflows

ip:

4405083 total packets received

0 bad IP headers

0 fragments received

0 fragments dropped (dup or out of space)

0 fragments dropped after timeout

0 packets forwarded

0 packets not forwardable

icmp:

79 calls to generate an ICMP error message

0 ICMP messages dropped

Output histogram:

echo reply: 78

destination unreachable: 1

source quench: 0

routing redirect: 0

echo: 0

time exceeded: 0

parameter problem: 0

time stamp: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

0 bad ICMP messages

Input histogram:

echo reply: 56777

destination unreachable: 14

source quench: 0

routing redirect: 0

echo: 78

time exceeded: 0

parameter problem: 0

time stamp request: 0

time stamp reply: 0

address mask request: 0

address mask reply: 0

78 responses sent

igmp:

0 messages received

0 messages received with too few bytes

0 messages received with bad checksum

0 membership queries received

0 membership queries received with incorrect fields(s)

0 membership reports received

0 membership reports received with incorrect field(s)

0 membership reports received for groups to which this host belongs

0 membership reports sent

donna hofmeister · ‎06-24-2016

I don't like that you have tcp checksums....but the number is so low that it's likely the "frankengram" scenario....

Your re-transmit rate is about 4% which may be enough for this application to have issues. I still urge you to raise this with your networking team. Perhaps they can do some tuning on their side.....

JDR45 · ‎06-27-2016

I think the problem is on the network at the remote location, the user's PC, or something with the Hummingbird Host Explorer software. I just don't see anything on the Unix server I can fix to help with this.

Learned a lot about netstat and ndd though, so thanks everyone!

akio_kabutogi · ‎06-28-2016

https://confluence.eits.uga.edu/display/HDSH/Hummingbird+Issues

has the description to set up keepalive signal to be sent from the client side under "Keep Alive Signal" section. This would be the perfect solution, I suppose, though tcp_keepalive_interval can be set upto 10*24*3600000.

If this action is set by the Hmmingbird side, then, you'll never get the session timed out.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Question about tcp_keepalive_interval (user sessions dropping)

Question about tcp_keepalive_interval (user sessions dropping)