Re: VMS 7.3-2 Copy slow on Alpha

Volker Halle · ‎02-16-2009

John,

I'm unable to download your attachment. Feel free to mail it to me (see my ITRC profile).

Regarding the quotas: note that you've successfully done a local copy on NodeB to NodeB::filename - bypassing the physical network. This should have also been slow, if there would be problems with disk-access, quotas etc. Please make sure, that you are using the same username for the 'remote' FAL process on NodeB in both test scenarios.

Volker.

Hein van den Heuvel · ‎02-16-2009

FTP has LOGICAL NAME options which, when set, may potentially be helping its performance a lot. Notably: TCPIP$FTP_FILE_ALQ and TCPIP$FTP_FILE_DEQ, but others exist:

http://h71000.www7.hp.com/doc/83final/6526/6526pro_041.html

I expect DECnet to just listen to RMS defaults.

MBC=32 is the new default, but is it sill on the low side. At least double that, or max out to 127. Some suggest the highest possible mutliple of 16 (112) might work slightly better XFC and Storage interactions.

You also want some more buffers. Like 4 or 8.
Going from 1 to 2 is potentially the big winner. Rapidly diminishing returns beyond that. I don't trust 127 buffers. That seems too high to help and may hinder. Try it though, as your mileage may, and will, vary.

Did you check for fragmentation on the output files as I suggested in an earlier reply?
$ pipe dump /head/bloc=count=0 | sea sys$pipe "Extension file"

Regards,
Hein van den Heuvel ( at gmail dot com )
HvdH Performance Consulting.

Oswald Knoppers_1 · ‎02-16-2009

"
NodeA: Lancp shows no errors on EWA0 (DE500-BA), 4x unrecognised multicast packets. And EWB0 (Not configured for Decnet!) shows last error at 18:07 Today and 3105 carrier check failures?
NodeB: Lancp shows no errors on EWA0 (DE500-BA), And EWB0 (Not configured for Decnet!) shows last error at 18:03 Today and 300 carrier check failures?
"

Are you sure the EWB0 devices are not configured for DECnet? Net$configure (in basic mode) will configure all lan devices it can find for DECnet. Check this with:

$ mc ncl sho routing circuit *

Do you have any routers involved?

$ mc ncl sho routing circuit xxx adja * all

Oswald

John_TT · ‎02-16-2009

Hello, I have increased the NSP MAX WINDOW parameter from 20 to 120 and seem to have consistent "fast" copy transfers on my test systems... Here are the MC NCL SHOW NSP ALL characteristics, any comments? Should "Congestion Avoidance" be FALSE ?

I'm off to try another system or 2...

$ MC NCL SHOW NSP ALL
Characteristics

Maximum Transport Connections = 200
Maximum Receive Buffers = 4000
Delay Weight = 3
Delay Factor = 2
Maximum Window = 20
DNA Version = T4.2.1
Acknowledgement Delay Time = 3
Maximum Remote NSAPS = 201
NSAP Selector = 32
Keepalive Time = 60
Retransmit Threshold = 8
Congestion Avoidance = False
Flow Control Policy
=Segment Flow Control

Willem Grooters · ‎02-17-2009

Since averyone seems to agree it's DECNet causing the trouble, I would ask a few other questions on the behaviour:

* Does it happen with the same pair(s) of nodes, or is it random?
* Does it happen on the same files, for each pair, or is it random?
* How fragmenetd is teh source disk?
* Is the file sereverly fragmented?
* How fraggmented is the destination disk?
* What's the extent size on thre receiving system?
* Have you thought turning off Highwater marking during COPY?
* If these files are indexed, how good (or bad) is the internal structure, and how are the buckets located on disk?

Willem

Willem Grooters
OpenVMS Developer & System Manager

John_TT · ‎02-17-2009

Hi Willem,

* Does it happen with the same pair(s) of nodes, or is it random?
>>Random, any 2 nodes from 40

* Does it happen on the same files, for each pair, or is it random?
>> Random files

* How fragmenetd is teh source disk?
>> The disks have regular defrag

Setting the NSP MAX WINDOW parameter to 60 also resolves the problem. i.e. A daily file transfer of 30 large files usually took minutes between 2 systems, yesterday it was taking 2 hours 56 minutes, the same transfer today took less than 9 minutes after changing MAX WINDOW. This appears to have changed from 1 minute to the next some days ago. Setting max window back to 20 brings back slow copy. Some of the systems have not been rebooted for almost 2 years and were ok until recently.

Volker Halle · ‎02-17-2009

John,

setting NSP MAXIMUM WINDOW size higher allows more data to be sent to the remote end without waiting for acknowledgements. This may be 'masking' (or solving) changes in the underlying network performance.

Note that you observed low performance especially when 'pushing' data to the remote node.

If the systems have not been rebooted in 2 years and you see this sudden drop in performance, it's hard to argue, that this is caused by an OpenVMS DECnet problem. Something else may have changed, which now causes those transfers to 'suffer' ...

But if you can override this by extending the transmit window, so be it ;-)

Volker.

Jon Pinkley · ‎02-17-2009

John,

I realize you stated "No pool problems." in response to Volker's question: "Any problems with nonpaged pool (SHOW MEM/POOL/FULL) ?"

Has your non-paged pool been extended, in other words is it larger than its initial size? You may be experiencing pool fragmentation.

You may want to consider disabling memory reclaimation. Search for NPAG_GENTLE and NPAG_AGGRESSIVE. Setting both to 100 will disable, but will require a reboot to fix. My guess is that increasing the windowsize is causing multiple buffers to be requested at the same time, which will then allow the stalls (allocating new request packets) to be consolidated, and then letting the copy stream for a while.

See following threads.

Bad performance Openvms http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=922770

Nonpaged pool problem http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1169934

High Interrupt CPU when shadow copy http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1099166

Non-paged dynamic memory http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=734128

Jon

it depends

John_TT · ‎02-19-2009

Hi, sorry for the delay. Regarding Pool, the current and initial sizes are the same.

I have now tried setting max window back to 20 and setting NPAG_AGGRESSIVE and NPAG_GENTLE to 100 on my test systems. The file transfers are now working fast.

Volker Halle · ‎02-19-2009

John,

before jumping to conclusions, please verify this result by setting both parameters back to their previous values and re-test the transfers.

Volker.

John_TT · ‎02-19-2009

I reset the 2 NPAG_ parameters and the slow copy is back. Changing either NSP MAX WINDOW or the 2 NPAG_ parameters clears the problem.

Hein van den Heuvel · ‎02-19-2009

Nice.

So supposedly the slowdown was visible in (kernel mode) CPU consumption. Would you happen to be able to confirm that, perhaps with T4 ?

And, if you can still spend time on this, do you suppose we can create a 'signature' from this with ANALYZE/SYSTEM tools like PCS or SPL to shwo particular code areas, or spin locks, being hit disproportionally when this is an issue?

Thanks!
Hein.

John_TT · ‎02-20-2009

Hi, thanks to everyone for the good troubleshooting info', haven't had so much fun in years! :-)

Hein, I would like to look further into this, but time is a constraint. I have made changes to 4 of our systems and will monitor for a while. What exactly would you like me to do?

I will be out from today until next Tuesday (24/2)...

Hein van den Heuvel · ‎02-20-2009

Well, it might be enough to do a MONI MODE during one of those 20 minute transfers.

A full T4 collection for starting a few minutes before and ending a few minutes after a transfers on an (idle-ish?) system would be nicer.

Irrespective of this excercise, you may want to some day try:
$ANALYZE/SYSTEM
SDA> SPL LOAD
SDA> SPL START TRACE
:

:
SDA> SPL STOP TRACE
SDA> SPL SHOW TRACE /SUM
SDA> SPL ANALYZE ...

Simlar comamnds for PCS and/or PRF.

Hein.

Volker Halle · ‎02-22-2009

John,

a T4 collection with a sample time of 1 second may be most useful.

SDA> SPL analysis only produces meaningful data on SMP systems.

SDA> PRF only works on V8.3 and EV6 systems.

SDA> PCS (PC-sampling) would be useful to collect PC values to determine, where the system spends it's time. It the pool reclamation theory is right, lots of PC samples should be in SYSTEM_PRIMITIVES.EXE - probably at IPL 11. (IPL$_POOL)

Gentle pool reclamation occurs every NPAG_INTERVAL (default=30) seconds and only handles 2 lookaside lists on each call, so it would be hard to argue, that a 'fast DECnet file copy' could actually run into this.

Aggressive pool reclamation is only called, if a request to allocate a packet from the nonpaged variable list failed. If NPAG_AGRESSIVE is set to 100, packets will stay on the lookaside lists and not get reclaimed, this may have a positive effect on performance. What does SHOW MEM/POOL/FULL show for Nonpaged Dynamic Memory ?

The NPAG_GENTLE and NPAG_AGGRESSIVE parameters are dynamic, so you can change them with SYSGEN> WRITE ACTIVE.

Volker.

Jon Pinkley · ‎02-23-2009

John,

Several things to consider:

More than a single system is involved, so a problem on either end can affect performance. You need to look at both systems' memory.

As Volker said, the NPAG_GENTLE and NPAG_AGGRESSIVE parameters are dynamic and therefore they can be changed without a reboot. However, most of the pool fragmentation due to frequent allocation and deallocation from the variable list will remain until the system is rebooted.

Just rebooting itself will "defragment" the pool, so the problem may no longer exhibit itself for a while (or until you run something that thrashes the pool). The fact that the systems haven't been booted for a long period of time suggests that the pool has had an opportunity to be fragmented, especially if pool reclamation is enabled.

Setting the NPAG_GENTLE and NPAG_AGGRESSIVE parameters to 100 will perhaps require more non-paged pool, but it should reduce the fragmentation of the free space in the pool.

What version of VMS are you running? Later versions have more lookaside lists. I don't know what size packets DECnet V is requesting; if they are larger than the size of the largest lookaside list blocksize, then they must be allocated from the variable list, and that gets expensive when the free pool gets fragmented.

Guessing here. Perhaps by specifying a larger windowsize, more buffers are allocated and then reused without freeing/allocating. So perhaps an optimal solution is a combination of turning off reclamation, and specifying a larger windowsize.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: VMS 7.3-2 Copy slow on Alpha