Operating System - OpenVMS
1826925 Members
2095 Online
109705 Solutions
New Discussion

Re: VMS 7.3-2 Copy slow on Alpha

 
John_TT
Advisor

VMS 7.3-2 Copy slow on Alpha

Hi, we seem to have developed a problem on our Alpha systems (38 DS10's and DS15's). A number of the systems are setup in master slave pairs with the master updating the slave by frequently copying files over the net. The application was written more than 10 years ago and has been running ok since, firstly on VAXen. But recently one of the application developers noticed that some file copies were taking much longer than usual, i.e. from 3 minutes to 12 minutes. We can reproduce the problem on any of 2 of the systems. Sometimes the copy works as expected in a few seconds but most of the time it is slow.

$ copy fileA.txt (50000 blocks) nodeB::*.*;
-Should take about 6 seconds (as seen in the past) but takes around 23-24 minutes to complete. The copy does eventually complete so there are no error messages seen. Have tried nodeB"system password":: same result.

$ Copy nodeB::fileA.txt (same file 50000 blocks) *.*;
-This takes around 6 seconds...

I have looked for duplex problems, switch problems NIC errors but can't find the problem. I can reproduce this problem on 2 standalone systems with 1 switch or with a crossed network cable. The NICs are set to auto-negotiate and the Cisco catalyst switches too, I have tried Fastfd settings on all (100Mb Full duplex on switches), no change.

The Alphas are DS10/DS15's with VMS v7.3-2 V0900 patch update, although I have tried with Update V1600 on 2 systems. There have been no significant changes to the systems in about 2 years.

Any suggestions on where to look next, or how to trace the problem?
40 REPLIES 40
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

welcome to the OpenVMS ITRC forum.

Check with $ MC LANCP SHOW DEV/INT for LAN error messages (bottom of page). Also check the LAN counters (LANCP> SHOW DEV/COUNT) on both systems.

Which version of DECnet are you using (DECnet-IV or DECnet-OSI) ? Or DECnet-over-IP ?

If there is ANY packet loss when using DECnet NSP or OSI transport on a LAN, the retransmit timeout is normally causing what's perceived as 'bad performance'.

You can 'see' the transmit timeouts by watching hte NODE (or link) counters between the 2 systems involved:

DECnet-IV:

$ MC NCP SHOW NODE nodeB COUNTER (from NodeA and vice-versa)

Look for 'RESPONSE TIMEOUTS' increasing while your 'slow' copy is active.

DECnet-OSI:

$ MC NCL SHOW {NSP | OSI TRANSPORT } LOCAL NSAP xxx REMOTE NSAP yyy ALL COUNTER

and look for 'Duplicate PDUs Received' and 'Retransmitted PDUs' increasing.

The timeout for a lost packet is typically quite high (many seconds) and will cause the traffic on the link between the 2 processes to momentarily stop.

Try to run MC DTSEND data/NODE=NodeB tests as well, this will rule out all possible disk-IO related delays. You need to setup the DTR (object or session control application) on the remote node prior to running this test.

Volker.
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

Hi Volker,
We are using Decnet-Plus
NodeA: Lancp shows no errors on EWA0 (DE500-BA), 4x unrecognised multicast packets. And EWB0 (Not configured for Decnet!) shows last error at 18:07 Today and 3105 carrier check failures?
NodeB: Lancp shows no errors on EWA0 (DE500-BA), And EWB0 (Not configured for Decnet!) shows last error at 18:03 Today and 300 carrier check failures?

Sorry what are xxx and yyy you refer to in the MC NCL check?

I have not used DTsend before what do I have to setup?

Our network guys say there are no lost packets on the net.

Thanks in advance
marsh_1
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

to be able to run DTSEND to a remote node, you have to set up the DTR session control application with a valid username. Use

$ MC NCL SET SESSION CONTROL APPLICATION DTR USER NAME = "MIRRO$SERVER" - assuming this username does exist in the UAF. It should, because it's the default for the MIRROR session control application.

Then invoke:

$ MC DTSEND
_Test: data/node=

DTSEND will send as many DECnet packets as possible to the remote node for 30 seconds and then report:

%NET-S-NORMAL, normal successful completion

Test Parameters:
Test duration (sec) 30
Target node "I64VMS"
Line speed (baud) 1000000
Message size (bytes) 128

Summary statistics:
Total messages XMIT 141600 RECV 0
Total bytes XMIT 18124800
Messages per second 4720.00
Bytes per second 604160
Line thruput (baud) 4833280
%Line Utilization 483.328

This is the easiest way to test DECnet performance between 2 nodes in the network. You can also increase the packet size and other parameters, see the documentation.

If you get inconsistent results for messages per second between multiple runs, like 4720 and then 50 the next time, it's time to look deeper...

In DECnet-IV you could easily see the node counters with MC NCP SHOW NODE COUNTERS. In DECnet-OSI, it is harder. DTSEND could be using NSP or OSI transport (or even DECnet-over-IP). To look at the 'node counters', you need to find out the local NSAP and remote NSAP addresses and the transport protocol being used (NSAP = Network Service Access Point).

The 'easy way out' would be to use $ MCR NET$MGMT (requires a X11-display), then click on Tasks -> Show Known Node Counters.

Finding the local and remote NSAP addresses goes like this:

$ MC NCL SHOW NSP LOCAL NSAP * ALL
$ MC NCL SHOW OSI TRANSPORT LOCAL NSAP *

Repeat these commands on the remote node, then check the node counters:

$ MC NCL SHOW NSP LOCAL NSAP REMOTE NSAP ALL COUNTERS

and the same for OSI TRANSPORT instead of NSP.

Volker.
Robert Gezelter
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

As has happened in another thread, please check the DUPLEX SETTINGS on ALL equipment. Severe slowdown (carrier check errors are one common indicator) are frequently caused by a duplex mismatch somewhere in the network path.

- Bob Gezelter, http://www.rlgsc.com
marsh_1
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

john,

have you tried hard setting the nic's speed , i know that auto neg is supposed to have fewer issues these days but ...


marsh_1
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

personal choice - would start with 100 half duplex.

fwiw

John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

Hi,

These 2 commands work fine on all systems:
$ MC NCL SHOW NSP LOCAL NSAP * ALL
$ MC NCL SHOW OSI TRANSPORT LOCAL NSAP *

This command works OK for NSP but fails with a "command failed no such object instance" message for OSI TRANSPORT:
$ MC NCL SHOW NSP LOCAL NSAP REMOTE NSAP ALL COUNTERS

The counters look OK no errors. DTsend also came back with consistent results on both test nodes with several tries.

As I mentioned I looked for Duplex problems at first. I have tried setting the NICs, at the console level, to FastFD and the switch to 100MB Full-Duplex with no change. The test systems are connected to an HP procurve switch, all set to auto-negotiate, the results are the same with a cross over cable and no switch.
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

so you apparently have not configured OSI TRANSPORT, that's fine.

Are you saying that the 'Duplicate PDUs Received' and 'Retransmitted PDUs' are ZERO for all remote NSAPs listed ? Can you identify the REMOTE NSAP for NodeB (use a NCL SHOW NSP LOCAL NSAP * on NodeB) and just watch those counters. The PDUs Received and PDUs Sent counters increase, while you run your DTSEND test between NodeA and NodeB, right ?

If your 25 MB file took 6 seconds to be transfered in the past, this would be a good 33 Mbit/sec utilization of a 100 Mbit network. You cannot argue, that the network is that much 'slower' now, especially, if DTSEND also shows good throughput. You could use /SIZE=512 to obtain a good approximation of the possible network bandwith for a file transfer. You could also add /TYPE=ECHO, so that massive amounts of data takes place in BOTH directions.

Use MONITOR DECNET on the local and remote node while testing with DTSEND, so you can get an estimate of the achievable packet rate. What's the rate during file copy ?

If all the DTSEND test show accetable performance and copy doesn't, you have to think about where the bottleneck may be. Can you test a $ copy local_file.txt nodeB::*.* on NodeB itself ? In that case, the data would not travel over the physical network at all ? What performance do you get then ?

Volker.
Hein van den Heuvel
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

Volker>> If all the DTSEND test show accetable performance and copy doesn't, you have to think about where the bottleneck may be.

When I first read this problem and did a back-of-the-enveloppe calculation on how slow the disk would have to be if there was a filesystem/disk problem involved.

Well, in a worst-case 1 block IO, that would be 50,000 / ( 24*60 ) = 34 IO/sec. But 1 block IOs are unlikely. Maybe 16 block IOs with a 32 block extend every 16 blocs? That would be 5+ IOs every 32 block = 8000 IOs, which would only be 5 IO/sec. Too easy.

Still... best check!
RMS Settings: (EXTEND, SEQ BLOCKS, NETWORK BLOCKS)
$SHOW RMS

and file fragmentation on the OUTPUT side after the copy:
$DUMP/HEAD/BLO=COUNT=0 slow_output.dat

Or perhaps even use an LD device and TRACE all the IOs (or use the XFC trace)

If still desperate, then I would for an explanation, then defining FAL$LOG to 17 (or whatever value to get the data messages logged) or so may be needed. The initial FAL capabilities negotiation may be interesting also.

Good luck!
Hein

John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

The copy nodeB$ copy fileA.txt nodeB::fileA.txt; works fine, also if I stop and restart decnet the first copy works fast then all others slow.

With dtsend running on both nodes /block=512 /type=echo, monitor decnet shows:
NodeA (DS10)
Arriving local packet rate max 13124 average 6206
Departing local packet rate 6206 average 3089

NodeB (XP900)
Arriving local packet rate max 6271 average 671
Departing local packet rate 13223 average 1459
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

... Now I have the copy from nodeB (XP900 to NodeA (DS10) working as expected copy both ways fast. But nodeA to nodeB is still slow... Maybe to be expected after the Decnet stats, what changed?
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

if you run one DTSEND test and MONITOR DECNET on both nodes at the same time, the arriving/departing packet average values should match each other - if there is no additional DECnet traffic at that time. The counters should also amtch the messages per second reported by DTSEND. You can increase the test interval by /MINUTES=5 to achiev a stable packet rate for monitoring purposes.

What does MONITOR DECnet tell, if you perfm a 'slow copy' ? Is the packet rate constantly slow or is it just ZERO for a long time and then jumps to normal ?

Did you verify the 'Duplicate PDUs Received' and 'Retransmitted PDUs' counters between those 2 nodes in case of the 'slow copy' ?

Volker.
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha


'Duplicate PDUs Received' and 'Retransmitted PDUs' counters between those 2 nodes in case of the 'slow copy'-- Both = 1 on NodeB and go up to 2 each on nodeA during slow copy.

Monitor decnet during slow copy is constantly slow, rx 4 tx 16 and vice versa on the other node.

Running 1 DTsend arriving/departing packet average values don't match, approx 1:2 as with the monitor decnet stats. And don't match the DTsend numbers.
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,


NodeB (XP900)
Arriving local packet rate max 6271 average 671


Is '671' a typo ?

When I run DTSEND DATA/NODE=xxx/SIZE=512/TYPE=ECHO/MIN=2, I see consistent arriving/departing local packet average rates on both systems with MONITOR DECNET. The systems are idle except for the DTSEND traffic. Departing rate on one node is about the same as arriving rate on the other note (difference in values < 5%).

Did you compare the DECnet-OSI parameters for the various entities ?

Any problems with nonpaged pool (SHOW MEM/POOL/FULL) ?

You can trace DECnet traffic with the CTF$TRACE utility, e.g.

$ TRACE START[/LIVE] "ROUTING CIRCUIT csmacd-0"

Try this for a couple of seconds with a 'fast' and a 'slow' copy. Then compare the results. TRACE ANAL can analyze the traces and output ASCII text files.

Volker.
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

NodeB (XP900)
Arriving local packet rate max 6271 average 671
Departing local packet rate 13223 average 1459

Assuming this data is correct, while continuously running DTSEND to or from this node. If you now run MONITOR DECNET/INT=1 on this node (logged in via TCPIP or LAT), is the packet rate constant or does it drop to ZERO from time to time ? This would then - again - indicate a problem with lost packets.

Volker.
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

Hi Volker,

The numbers are as I stated.

Can't find anything on CTF$TRACE ?

The monitor decnet stats are constantly low for the slow copy.

No pool problems.

I haven't checked the DECnet-OSI parameters for the various entities ?
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

$ TRACE HELP

or see the 'DECnet/OSI for VMS CTF use' manual on

http://h71000.www7.hp.com/doc/decnetplus83.html

So is only NodeB (XP900) showing 'the problem' then ? DTSEND rates on that node (arrriving packet rate) seem to be 10x slower than on the DS10.

You may need to provide more detailled test results. Run one DTSEND operation at a time, specify the parameters and where it's being run and the message rate achieved. Also provide MONI DECNET date from both nodes involved in the test. Build a table from the data from these tests and find out exactly WHICH operation causes unexpected bad performance.

Compare DECnet-OSI parameters for:

NCL> SHOW NSP ALL
NCL> SHOW SESSION CONTROL ALL
NCL> SHOW ROUTING ALL
NCL> SHOW ROUTING CIRCUIT * ALL

Volker.
marsh_1
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

volker,

does decnet-osi have the equivalents of buffer quota , pipeline quota from phase iv ?

Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

Mark,

yes, it's called NSP MAXIMUM WINDOW and MAXIMUM RECEIVE BUFFERS.

That's why I asked for checking of all the parameters for the various DECnet-OSI entities.

Volker.
Volker Halle
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

did you ever try to reproduce this behaviour with COPY/FTP ?

Volker.
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha

Hi, I finally woke up and started the trace, nodeA_trace.log and nodeB_trace.log attached. Also attached are nodeA.log and nodeB.log with the NCL OSI parameters (I would have posted them sooner but the copy took a while...). I ran a slow then fast copy during trace.

I have 2 systems on test with only Decnet running, nothing else is started, but copy /ftp worked fast.
Robert Gezelter
Honored Contributor

Re: VMS 7.3-2 Copy slow on Alpha

John,

If I understand your last posting correctly, the COPY over DECnet performs poorly, but the equivalent FTP transfer performs in line with expectations.

If the connections are TRULY through the same path (something to be confirmed with a LAN Analyzer, e.g. WireShark -- NOT with displayed settings -- in this case I prefer hard observable data where possible), the excludes the network.

Quotas associated with DECnet are a possibility, as has been already discussed. Also, please check the RMS parameters (SHOW RMS) from within the NETWORK receiver process. I can imagine a situation where disk fragmentation, and the resulting ongoing extends, create a bottleneck in transfer performance.

As an experiment, try increasing the RMS EXTEND, BUFFER, and BLOCK parameters in the LOGIN.COM file for the server account (say EXTEND=2000, BUFFER=128, and BLOCK=127; it may also be necessary to increase the accounts page related quotas and working set). See what happens.

- Bob Gezelter, http://www.rlgsc.com
John_TT
Advisor

Re: VMS 7.3-2 Copy slow on Alpha


Hi Bob, The show rms command from the account I am using shows all zero's except for:
System multi block count = 32 and System network block count 127 (on both systems).