Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

 
Duncan Morris
Honored Contributor

VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

I recently installed VMS732_LAN V7 patch on a test server, which also functions as an availability manager server node.

Since installing the patch I have found that availability manager appears to lose contact with remote collector nodes after about 2.5 hours of running. It continues to provide data about the local node to the analyser, but the other collector nodes on the LAN appear as "Path lost" in the analyzer.

If I issue

@sys$startup:amd$startup restart

(without restarting the AM server process), then the missing nodes reappear quite happily.

I could find no events in the availability manager logs indicating any problem.

The only other information I have gleaned is that the LAN device starts clocking up "Unavailable user buffer" errors until I do the restart

DEVT02> mc lancp show dev/cou

DEVT02 Device Counters EWA0 (20-NOV-2009 15:07:37.14):
Value Counter
----- -------
11028 Seconds since last zeroed

309 Unavailable user buffers (20-NOV-2009 15:07:26.24)

The LAN device is a DE500-BA, on a DS20e.

When I back out the patch installation, then availability manager runs without any issues.

Has anybody else noticed any "funnies" with this patch?

Many thanks,

Duncan
16 REPLIES 16
Volker Halle
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Duncan,

while you're running the Availability Manager protocol, you can check the protocol specific counters with ANAL/SYS and SDA> SHOW LAN/FULL/DEV=EWA0

Look for the counters for protocol (AMDS) and check the 'Unavailable User buffer' and 'Last UUB time'. This should allow you to at least find out, if the Unavailable user buffers events happen for this specific protocol.

Volker.
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Thanks Volker.

At the moment all the protocols are showing "None" for Last UUB Time.

The generic device EWA has a last UUB time coresponding to the time I last restarted AMDS.

I will have to wait until the next Path Lost event to check out the specific EWA22 80-48 (AMDS) Counters.

Duncan
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

On cue, the AM display has dropped the remote nodes!

I have attached the SDA output for the specific EWA22 device.

If you think that there is any value in the whole SHOW LAN/FULL output, then I can add that in.

Duncan
Volker Halle
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Duncan,

Last receive 20-NOV 17:35:44, this is a long time ago (about 8 minutes)! And PDUs received exactly 5000 ?!

Did you check all other protocols for non-zero 'Unavail user buffer' ? Or is AMDS the only protocol affected ?

Volker.
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Volker,

only EWA22 (AMDS) shows any UUB count.

Full SDA report zipped in attachment

Duncan
Volker Halle
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Duncan,

please consider to report this problem to HP.

Every node running the AMDS protocol should send AMDS multicast messages every couple of seconds, so NOT receiving any AMDS messages for more than 8 minutes is very unusual.

You may also try to look at the default LAN trace data in SDA:

$ SET TERM/WID=132
$ ANAL/SYS
SDA> LAN TRACE
...
SDA> EXIT

Volker.
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Volker,

I will look at LAN trace when I am back at work.

In the meantime, I took another snapshot of the counters - and guess what? The device stopped receiving after 5000 PDUs again (see attachment).

This is surely not coincidence!

I have just restarted AMDS and will check the counters again tomorrow.

Many thanks for your advice.

Duncan
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Following last night's restart of AMDS, the problem returned once again after exactly 5000 PDUs received.

I will try to get this raised to HP (we have a 3rd party support contract, so no direct HP route).

Duncan
Graham Burley
Frequent Advisor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

From the VMS732_LAN-V0700 Cover Letter
http://decuserve.org/anon/htnotes/note?f1=ECO&f2=2621.0

5.2 Problems addressed in this kit
5.2.6 Excessive Pool Consumption
5.2.6.1 Problem Description:

[snip] This change enforces a maximum limit of 5000 buffers outstanding for any application running on a LAN device, [snip]

[snip] If pool reclamation is in use, and if the limit of 5000 outstanding buffers is unreasonable, you can override this restriction by setting the system parameter LAN_FLAGS bit 16 (0x00010000). [snip]
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Graham,

interesting! There shouldn't be any buffers outstanding - but maybe that is the problem.

I have not observed pool consumption issues on other nodes which have had the availability manager server running for long periods, so perhaps there is a problem with the way in which the LAN patch has been implemented.

I will play with LAN_FLAGS on Monday and report back.

Duncan
Volker Halle
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Duncan,

the extract from the release notes found by Graham certainly explains the '5000' number seen.

AMDS will most likely NEVER have 5000 'outstanding' network IOs at the same time, but there may be some problem in some return path from the LAN driver, which does not correctly decrement the number of outstanding buffers and therefore cause this problem.

This problem can only be diagnosed and solved by the HP engineering team, which designed this new algorithm and patch. Maybe you can work around the immediate issue with the LAN_SYSTEM_FLAGS system parameter, but the underlying problem must be fixed.

Volker.
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Volker - I agree with your analysis.

Graham - It looks as though setting BIT 16 in LAN_FLAGS has provided a temporary workaround, as there are currently more than 5000 received PDUs showing on the latest device.

Clearly an unexpected interaction between the new algorithm and Availability Manager. Only HP will be able to determine which one is at fault!

I will be escalating this tomorrow (Monday).

Duncan
klingeren
Occasional Visitor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Hi there, same problem over here, running LAN V700, on analyzer node en data provider nodes. AMDS traffic stops and AMDS must be stopped and started again to be able to monitor for a while (sometimes a few minutes)
Can anyone tell status?
regards, Rene
Hoff
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Given the timing, it appears unlikely we'll see an update for this misbehavior prior to "lockdown", so set the bit flag cited earlier as a potential work around, and (if there's formal support in place) report this to HP.
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

Rene,

these problems are not described in the release notes for the newer V3.1 Availability Manager - but are documented in the VMS V8.4 release notes!!!

See section 4.5 from this document:

http://h71000.www7.hp.com/doc/84final/6677/6677pro_sm.html

Correspondence with HP support resulted in them saying that there was nothing wrong with the LAN patch, nor with Availability Manager - their only solution was to use set the LAN_FLAGS bit.

Duncan
Duncan Morris
Honored Contributor

Re: VMS732_LAN-V0700 and AVAIL_MAN_ANA_SRVR V3.0-2

The only resolution for this issue appears to be the setting of the LAN_FLAGS bit as indicated.