General
cancel
Showing results for 
Search instead for 
Did you mean: 

Informix database replication stalls between servers

Informix database replication stalls between servers

Hello all,

We have a customer using Informix Dynamic Server 7.31 on an L-Class server. They are trying to use a facility in the database that allows them to replicate the database to another L-Class of the same specification.

This replication uses tcp/ip sockets defined in the /etc/services file, one database acts as a listener and the production one transmits data to it in the form of transaction logs.

At the moment it is not working, as the replication appears to ???stall??? between the two servers.

The servers have been patched to March 2002 Hardware and Quality Pack with the latest ARPA patches as well.

They have also tried using several different socket numbers but to no avail.

At the moment we are in the scenario that we cannot prove if it???s the OS, Database or the Network that is the problem.

The current diagnosis of the problem we have at the moment is summarised below:

The primary server is the one sending logs to the secondary

On the primary via tracing (in the database) we can see that the server has filled a buffer with a section of the logical logs and is attempting to pass this via the operating system to the open port. The OS returns error 246 that is EWOULDBLOCK.

Using the command "netstat -an | grep 300" we can pick out the relevant port. The sendq column is at its maximum value (i.e. is full). For some reason, this is never reduced.

Eventually the primary server assumes (probably correctly in this case) that the connection is broken. It closes down the port and waits for the secondary to attempt to reconnect.

Switching to the secondary, once again using "netstat -an | grep 300" we can see that for some reason, the port remains open. The primary is completely unaware that the other machine has closed this port. The port is freed once the OS's keep_idle limit is reached.
This notifies the secondary, which then attempts to contact the primary server, restarting the cycle.

To summarise, The database provider believes the engine on both the primary and secondary servers is functioning correctly. There are two points in which the operating system or network appears to be failing.

Firstly, the sendq appears to be failing to drain. Secondly, when the port is closed upon primary, the instruction to close the port on secondary is either ignored or not received.

Any assistance much appreciated

David Rew, Waverley Technical Services.
2 REPLIES
Rainer von Bongartz
Honored Contributor

Re: Informix database replication stalls between servers


This is a well know defect with INFORMIX defect number49592


Detailed Information For Defect 49592
DR PRIMARY THREAD HANG WHEN HP SYSTEM CALL SELECT MALFUNCTIONS.

Long Description:

Problem occurred when DR primary thread wanted to send but received an "EWOULDBLOCK" network error. After several retried, the DR primary thread would get into yield state to wait for network recover from busy state. However, if network took too long to recover, the DR primary thread would sleep forever and ignored the wakeup signal sent from DR ping thread.



Order a release which fixes defect number 49592


Regards
Rainer
He's a real UNIX Man, sitting in his UNIX LAN making all his UNIX plans for nobody ...

Re: Informix database replication stalls between servers

Thanks for this one but according to IBM this is an old 1995/96 problem which is fixed in the version the end user is running.

Any other ideas would be appreciated. If you think IBM are stonewalling let me know as we know almost nothing about Informix.

Thanks

David