Re: TCPIP services do not always react

Willem Grooters · ‎11-26-2003

Customer problem....
two VMS machines, not clustered. Both VMS 7.1-2, machine A TCPIP 5.0A ECO1, machine B TCPIP 5.1
On machine A services a number of external applications. Actually, it's the same commandprocedure, the same user for each, accessing the same data, but on behalf of different systems (Unix, Windows). A third instance services requests from an application on machine B. These three services are (of course) on different ports: on 3011 (S1), 3012 (S2) and 3013 (S3) and have different processname; each has a limit of 15.
The rpogram starting behind it will keep the session opened, and will handle each subsequent request.
S1 has been activated a total of 12 times, S2 the full 15. So I have 12 times a process named S1_, 15 times S2_, all active for days.
However, process S3 can only be invoked a 2-3 times, but the next one will not even produce a logfile. The process (on system B) that tries to access this service gets some error on return. For what reason, I cannot tell (at least: for now, since the program needs to be altered for that). But giving the error and the fact that the service on system A does not produce ANY logging, I conclude that the service isn't even started.
The question is why. Since it DOES work appearently (S1 and S2 DO run, and S3 does for some seeions at least) there must be something that limits the ability for opening extra channels. But where?

Willem Grooters
OpenVMS Developer & System Manager

Lokesh_2 · ‎11-26-2003

Hi ,

Check MAXPROCESSCNT sysgen parameter. May be system has reached that value.

Thanks & regards,
Lokesh

What would you do with your life if you knew you could not fail?

Lokesh_2 · ‎11-26-2003

Or check for maximum no. of ucx device sockets :

$ucx sho comm

hope this helps,
Lokesh

What would you do with your life if you knew you could not fail?

Antoniov. · ‎11-26-2003

Hi Willem,
sorry to can't help you; I only can encourage and I try give you a clue: some service on VMS have limitated connection; for example, if you type TCPIP SHOW SERVICE /FULL you can see limit: nn where nn is the max telnet connection to server. I'm happened on my customer this value were 3 and the 4th PC cannot log in without any error in any log.
Your trouble sounds like this limitation; look at your service characteristic to discover limit value.
Bye
Antoniov

Antonio Maria Vigliotti

Willem Grooters · ‎11-26-2003

Lokesh:
MAXPROCESSCNT is big enough. It _could_ be the limit, but when I asked it had just happened again, MAXPROCESSCNT is over 1200 and number of processes as that moment was less than 600.
Numnber of sockets _could_ be a problem, just look at attachment (most likely happening on NODE3 - the requestor). But how to increase it? I dug into the documentation but didn't get a clue...
Antonio:
/LIMIT is not the point. It happens if way below that number....

Anyone - I've been told by a collegue it could very well be a matter of buffer exhaustion. But again: I cannot find a clue on how to increase this.

Attached: some info from each node involved. I don't know which node on the cluster invokes the problem, I tend to suspect the sender....

Willem Grooters
OpenVMS Developer & System Manager

Lokesh_2 · ‎11-26-2003

Hi Willem,

I have just posted a new thread about UCX SHO COMM command's output difference in older and newer versions of TCPIP. In older version, the Maximum, current & peak no. of device sockets were displayed, whereas in newer one it do not.

To count the no. of active devices sockets on system is - counting the no. of BG devices on the system. But question is how to find the maximum no. of device sockets in newer version ???

Best regards,
Lokesh

What would you do with your life if you knew you could not fail?

Willem Grooters · ‎11-27-2003

Some more info was given:

The service is only enabled on one node of the cluster (NodeB).

If the problem occurs, it will happen that the sender (NodeC) will hang for some time. If the service on port 3013 is disabled and enabled on nodeB, the problem is over - for some time. But after a few requests have been sent, the problem turns up again.
I have requested somne more info and updated the document - again attached (plain text).

Idea: could it be that the limit exists on NodeC? Since the service on NodeB is not started at all (not even a message!) it could be possible the request was never sent?

Willem Grooters
OpenVMS Developer & System Manager

Antoniov. · ‎11-27-2003

Hello Willem,
as I told prior I'm not sure about the reason of your trouble.
Service on port 3013 is limited to 15 connection; perhaps, if I understand, you need less then 15 connection concurrently; if you suppose some connection are not right close, may happens (after 15 connection) your sistem hangs because NodeB has exhaused resource due prior active (also unused) connection. I realize that I simply a lot the problem but you can check this quickly if you set service limit to 50 (for example): if your problem happens later (because it happpens however) you could investigate why some connection stay alive.
Remember if you change service limit you must stop service and restart it.
At moment I've not any other idea about.
Antoniov

Antonio Maria Vigliotti

Willem Grooters · ‎11-27-2003

Antonio,
LIMIT this is not the problem. See attachement, I tried to explain in more detail.
But I appriciate your new thread, it gives me a next request for information ;-)

Willem Grooters
OpenVMS Developer & System Manager

Willem Grooters · ‎11-27-2003

New info leaking in..
Setup testmachine with same application environment, this machine has no problem at all, where NodeC hangs time after time. Even when NodeC request is waiting to be connected, this testmachine's request is served! It can NOT be replicated there.
It seems intermittendly going wrong. One time the request comes on NodeB, a request issued just a few moments later will end in falure but repeated, it _may_ succeed. There's no guarantee it will. We didn't find a pattern. It seems the requests is never leaving NodeC, since we don't see anything happen on NodeB - where we DO see that the testmachine IS serverd. (Number of active services is increased).

So we concluded so far:
* The problem is NOT on NodeB, otherwise there would be problems with other systems as well, and the testsystem has no problems.
* The problem is NOT the NIC on nodeB - for the same reason
* The problem is NOT the NIC on NodeC - for the same reason
Remains: some setting on NodeC.

We're open for suggestions WHAT to change....

I included ana/sys output from both nodes, and the current SYSGEN parameters on NodeC. BTW: The application uses an RDB database on that machine. For that reason, some parameters will have quite high values.

Willem Grooters
OpenVMS Developer & System Manager

labadie_1 · ‎11-27-2003

I do no know if you can afford to do that, but can you simply take a crash dump when you have the problem ?

then you will have plenty of time to analyse the hang.

Regards

Gerard

Willem Grooters · ‎11-28-2003

Gerard,

I would suggest this when it were just a test system. But this is a procuction system running a database. Last resort, perhaps, and only if really unavoidable.

Willem Grooters
OpenVMS Developer & System Manager

Antoniov. · ‎11-28-2003

Hello Willen,
here some clue to analyze.
a)TCP/IP is good installated?
TCPIP>sysconfig -s
You must see inet,socket and arp loaded and configurated (on all hosts).
b)Have you sufficient socket?
TCPIP>sysconfig -q socket
somaxconn must be at least 1024
HP hints a high value (also 65536) on server (on NodeA and NodeB). Also HP hints on server set pmtu_enabled=0. Here you can read more details: http://h71000.www7.hp.com/doc/73final/6631/6631pro_contents.html

Reread you attachment; I've seen on NodeB out-of-order packets are 0,27% while on NodeC rate 2,16%; may be trouble is on NodeC?
On NodeC:
TCPIP>SH DEV
Look for dev used for request service, then
TCPIP>SH DEV /FUL
Here you could find some insuficient value.
Can you repeat on server NodeB, too.

Bye
Antoniov

Antonio Maria Vigliotti

Willem Grooters · ‎11-30-2003

Antonio,
My guess is indeed that NodeC causes the problems. However, it's not the services that go wrong. NodeC issues the request so outgoing traffic seems to be the problem. It may depend on other TCPIP traffic (Telnet sessions...), so I've asked for some more details - when the application seems to hang.
(Alas, I have no direct access to that machine, I have to rely on others....)

Willem Grooters
OpenVMS Developer & System Manager

Willem Grooters · ‎12-03-2003

Found on NodeB that one counter is larger - see attachement.
What is "sobacklogdrops" - connections dropped due to time-out?

Willem Grooters
OpenVMS Developer & System Manager

Antoniov. · ‎12-03-2003

Hello Willem,
in link I posted upper, you can read:
[...]
Network performance can degrade if a client overfills a socket listen queue
with TCP SYN packets, thereby blocking other users from the queue. To
eliminate this problem, increase the value of the sominconn attribute to its
maximum value. If the system continues to drop SYN packets, decrease the
value of the tcp_keepinit attribute to 30 (15 seconds). Monitor the values of
the sobacklog_drops and somaxconn_drops attributes to determine whether the
system is dropping packets. (See Section 2.3.2 for more information about event
counters.)
You can modify the tcp_keepinit attribute without rebooting the system.
[...]2.3.2
The socket subsystem has three attributes that monitor socket listen queue
events:
â ¢ The sobacklog_hiwat attribute counts the maximum number of pending
requests to any server socket.
â ¢ The sobacklog_drops attribute counts the number of times the system
dropped a received SYN packet because the number of queued SYN_RCVD
connections for a socket equaled the socketâ s backlog limit.
â ¢ The somaxconn_drops attribute counts the number of times the system
dropped a received SYN packet because the number of queued SYN_RCVD
connections for the socket equaled the upper limit on the backlog length
(somaxconn attribute).
The initial value of these attributes is 0. Use the sysconfig -q socket command
to display the current attribute values. If the values show that the queues are
overflowing, you may need to increase the socket listen queue limit.
The value of the sominconn attribute should equal the value of the somaxconn
attribute. When these two attributes are equal, the value of somaxconn_drops
will have the same value as sobacklog_drops.
However, if the value of the sominconn attribute is 0 (the default), and if one
or more server applications uses an inadequate value for the backlog argument
to its listen system call, the value of sobacklog_drops may increase at a rate
that is faster than the rate at which the somaxconn_drops counter increases. If
this occurs, you may want to increase the value of the sominconn attribute.

H.T

Antonio Maria Vigliotti

Willem Grooters · ‎12-10-2003

I've asked for more details:
Node B is 4100, 2G memory. I counted over 400 IP sessions.
NodeC is ES40, 2Gb memory, with 124 IP sessions.
Testmachine - functionally equal to NodeC - is some small, old Alpha system.
Whenever NodeC cannot connect (hangs), the very same request is repeatedly sent from this (relatively slow)testmachine, and it succeeds time after time. This kind of proves there is something wrong on NodeC.

A sudden thought: Could it be a case that ES40 is far to fast compared to 4100?

I have asked for tracing (TCPTRACE) on both nodes to see what traffic occurs. I will come back to this later.

Willem Grooters
OpenVMS Developer & System Manager

Willem Grooters · ‎12-16-2003

Now we tried TCPTRACE - default settings, on both NodeB and NodeC, and on testmachine.
NodeC had a problem with TCPTRACE, couldn't lock the pages in the working set. After /BUFFERS=50 (half the default) no data could be written.

Could it be a memory assignement problem - Too many connections perhaps? That could explain why one request will succeed one time and fail another....

Willem Grooters
OpenVMS Developer & System Manager

Ian Miller. · ‎12-16-2003

re problem with TRACE - this suggested a thought that your process quotas in SYSUAF are insuffient to run TRACE with the requested numbers of buffers but on one system the PQL_M system paramters are raising the quotas to a level sufficent to allow TRACE to run. Parhaps a similar problem exists with the original application. Compare PQL_M and PQL_D parameters on the systems. Check actual quotas that relevent processes are getting (not necessarily what you specify due to PQL_ parameters).

____________________
Purely Personal Opinion

Willem Grooters · ‎12-21-2003

A bit of an update
After consulting HP we found this:
The application on NodeC starts the communication with the right IP address: 10.21.0.12 (we can prove that!). However, a BG-device than allocated says the remote system is 108.21.0.12. It won't find that machine - so the connection times out.
If we specify the nodename : NodeB, all is running fine. Without a problem!
So my first idea was to suspect routing tables that contain the wrong information, but in second thought that couldn't be true, since when nodename was specified, taht would than show the same problem. So it's not the routing tables....

Final possibility: The module that initiates the connection is erring. It uses the socket interface. Still I don't get it. This module is used so very often, in so many applications that my thought is that it should have problems elsewhere. But this is the first (and so far: only) place that we've got trouble with it.

Willem Grooters
OpenVMS Developer & System Manager

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: TCPIP services do not always react

TCPIP services do not always react