Re: unable to handle buffer overflow

wong seng guan · ‎08-10-2009

Basically we have 2 Openvms systems, one is the central server for our online DB and another one is the terminal facing integrity openvms gateway server
As we approach our peak transaction period, our online server will slow down the communication between gateway servers, this will create communication data pile up in the TCP socket buffer at the gateway server end, when it hit the threshold it will show "tcp_writeAst: write error 28 "(buffer exceeded) and the communication between online server and gateway server will never recover until we reset the TCP connection.
Currently we have resorted to increase TCPIP tcp sendspace buffer to 987,136 from 61142 and maxbuf to 32000 from 8192, this so far we have stop the "Write error 28" from happening.

We are worried that this will happen again as our terminals connection to gateway increase, with the traffic volume increase, more processing at our online server. The end result the error will happen again.

My question, why is the tcpip stack in integrity server unable to handle this type of buffer overflow checking?
The TCPIP version for the integrity is 5.6 ECO2
Please find attached TCPIP show version from integrity server for your reference.

Below are the servers' specifications

Online server
Alpha DS20
OpenVMS 7.3
TCPware

Gateway server
Integrity RX2660
Openvms 8.3 1H1
TCPIP ver 5.6 ECO2

Online----LAN----- gateway ---- router ---VPN ----- router --- terminals

Hope to hear from your soon.
Regards,

Ananth S · ‎08-10-2009

The online server may not be fast enough in reading the data from the socket sent by gateway. Hence the send buffer may be fillin up on the gateway. Looking at the packet traces would help you determine if this problem. Also, looking at the socket buffers on the online server may give some clue.

wong seng guan · ‎08-10-2009

Online server----LAN----- gateway server ---- router ---VPN ----- router --- terminals

Robert Gezelter · ‎08-10-2009

wong seng quan,

With all due respect, there is a significant amount of information not yet present.

First, there are several possibilities here. One is that there is a problem in either of the TCP/IP stacks (please note that OpenVMS for Alpha 7.3 is very old, and that the version of TCPWARE is not included in the OP).
Second, the problem could be caused by a programming error in the database server, were it to stop processing requests from the gateway. Depending upon the transaction volume, the resulting backlog could conceivably create a scenario like what is being seen.

Is some event happening on the database server that is stopping processing of events? What is the time scale needed to create this problem (e.g., does the depth of the buffering hold 0.5 seconds of data or 50 seconds of data)?

- Bob Gezelter, http://www.rlgsc.com

Hoff · ‎08-10-2009

All network protocols and all applications encounter bandwidth limits as they are scaled up.

When the network (or application) limit is reached, the network protocol (or the application) attempts back-pressure, or the protocol drops messages. Which of these occurs depends on how bad the overload problem is, and how the protocol is designed; trade-offs made by designers.

Details of the particular limits varies. Widely.

While going to larger buffers can sometimes help smooth over the handling of bursty traffic, going to larger buffers with excess traffic simply forestalls the inevitable failure.

There are various techniques available here; depending on what limit is arising. Resolution could involve telecom bandwidth improvements or techniques including data compression, faster hardware or sharding techniques or both, tuned application software, or well, something else.

It might mean the database is overloaded, the database design needs changes, the database server is overloaded, the network is overloaded, the disks on the server are overloaded, the application is overloaded, the quotas on the application server processes are insufficient, the memory on the servers is insufficient, or , well, anything.

Or, yes, there could be bugs in OpenVMS or the IP stacks, or in the application code. (OpenVMS and IP bugs are far less likely than application bugs.)

The likely first priority is to identify the particular limit being encountered here. That's also going to be your job; we cannot do that without direct access to the servers and the network involved.

wong seng guan · ‎08-11-2009

Thank you all guru here

We have 11 gateway server and one online server and no stopping processing when problem occurred.

Mostly the problem happened when on pick hour about 2 to 3 minutes and the buffer will just jammed that until we reset the link of both servers

We have sufficient bandwidth from telecom and no data compression for our network.
So far there is no overloaded for online server coz the utilization of the processor and memory were less than 50%

Guru here
How about the mechanism of HP TCPIP?
Is there a way for stack to monitor the socket buffer when it reach quota limit?

In theory the Gateway server should resume send data to online server if the socket buffers have reduces from maximum, but I need to reset both links only resume send data.

Thanks
Wong seng guan

Robert Gezelter · ‎08-12-2009

wong seng guan,

TCP buffering will only do so much. Also, as Hoff has observed, it is not possible to exclude the possibility of two (sets?) of bugs: one in the application(s) on the gateway system and one in the TCP implementations, or the interaction of the different implementations (such interaction problems are exceedingly rare, but they do happen).

First, TCP buffering should generally not be used as a queue management implementation. If one is doing transaction processing of some sort, one should either implement a buffering solution (with flow controls) in the server (and its access routines that are used in the clients), or use a request management package (e.g., RTR).

Identifying problems with the TCP stacks themselves would require any or both of two approaches: a trace of the communications flow to/from the server with a tool such as WireShark (http://www.wireshark.org) and some simple, stripped down test cases that produce the aberrant behavior.

The test cases can then be provided together with the resulting traces to the appropriate developers to reproduce the problem.

As a start, I would want to study the central server, to understand just what is going on that is causing the bottleneck. Approaches that may work at moderate or intermediate loads may not be suitable at high rates of activity. A performance study of the central processor and a code review are probably a good start.

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: unable to handle buffer overflow

unable to handle buffer overflow