Very bad Performance over native DECnet

Heinz W Genhart · ‎08-31-2006

Hi community

We found a very strange behaviour during DECnet Copy.
I attached a Excel sheet with the results of my measurings with DTSEND.

We are generally using DECnet over IP.
The two nodes ABC001 and ABC002 are Cluster Members. So if we copy something from one of those nodes to the another one, nativ DECnet is in use. In all the other cases DECnet over Ip is in use

If I use DTSEND form ABC001 to ABC002 the performance is very very bad. If I do same test by using the scssystemid instead of the nodename the performance increases, but is still not good enough.
If I do same test, but from ABC002 to ABC001 the performance is much better.

The Nodes ABC001, ABC002 and DEF002 are in same IP subnet (so all tests marked with N/A are not possible, because we do not have any DECnet Routers).

Does somebody have a idea, where the problem is?

We checked allready the tower informations with MC DECNET_REGISTER, we flushed the cache (mc ncl flush sess contr nam cache entr "*")

The measuring results are not really reproducible, because we have a lot of DECnet traffic during working hours. So if I do dame test multiple time the results are different for any masuring.
I will try to meassure again this night and I hope that we have more and better reproducible figures.

Thanks in advance

Heinz

Robert Gezelter · ‎08-31-2006

Heinz,

I have had many situations where the underlying problem was a duplex mismatch somewhere in the network.

What made it seem to appear randomly was the question of other traffic on the network.

Another possibility is collisions somewhere in the network.

As a start, I would review the error counters in the path that the DECnet traffic is taking (note that this may be different than the path that it is taking when it is routed as DECnet over IP).

- Bob Gezelter, http://www.rlgsc.com

Andy Bustamante · ‎08-31-2006

You can check counter and speed duplex setting in LANCP

$ MC LANCP
> show dev /counter
> show dev /char

Recent versions of VMS have reportedly improved autonegotiation, however I still hard set both the server and the switch.

Andy

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

Volker Halle · ‎08-31-2006

Heinz,

my first guess is: lost packets.

The DECnet re-transmission timing is very poor when compared with other network protocols. Once a DECnet packet is lost, it takes a looong time until that packet gets re-transmitted. For COPY or DTSEND, this looks lik 'bad performance', but in reality, nothing is being transmitted during the timeout period.

With DECnet Phase IV, you could easily do MC NCP SHOW NODE destination-node COUNTERS from the source node and you would look for 'Response Timeouts'.

With DECnet-Plus, it gets a little tricky. The easiest way is to use MCR NET$MGMT (needs DECwindows display) and look at Tasks -> Show Known Node Counters and look for Retransmitted PDUs to the destination node of your DTSEND test.

You can also drill down on the (NSP or OSI) transport -> local NSAP -> remote NSAP -> NSAP address of dest - then Actions -> Zoom will show you the counters. Look for 'Retransmitted PDUs' and/or 'Duplicate PDUs received'.

Another way to verify, if you are loosing packets, would be to run MONI DECNET or MONI PROC/TOPBIO while DTSEND is running. If you do not see a constant rate of IOs, the chances are high, that you're loosing packets in the network path and have to wait for re-transmissons.

Once you confirm, that this is the reason for the perceived 'bad performance', then comes the interesting part of trying to find out, where the packets get lost.

Volker.

Heinz W Genhart · ‎08-31-2006

Hi Robert and Andy

First thing what we did was to check the counters in LANCP. There is no problem like Collisions, Frame Check errors or something like that. All the counters are 0, except the counters for packets/bytes received/sent.

We are using Cisco Switches and the ports where our machines are connected to, are set to 100M Full Duplex. The Interfaces are also set to 100M full duplex, done in console mode.

Each of the two problem machines has 2 dual NIC's. We configured FailSafe IP and all 4 lines are configured for DECnet

I think this is not a hardware issue, because as you can see on the excel sheet, that some connections from remote machines using DECnet over IP are so fast, as expected. So we (me and the Swiss OpenVMS Ambassador) are think, that this is a problem with name resolution, lost packets, towers or something like this. (but what?)

I think we will follow the instructions of Volker.

I compared the NCL scripts in sys$specific of the two machines. They are identical, except the addresses.

Our good luck is, that this are two testmachines (GS1280). But in our case, testmachine means, that there is a test team (40 people) and a development crew (80 people). For us, the Systemmanagement those machines are like a production machine, because we have to announce changes many days before we do them. Even during night, we can't do there something like a reboot without preannounce.

This afternoon we started to look at the CDI caches, we tried to use sys$update_decnet_migrate (show path to local:.nodename), but we don't have yet some brainy results.
I think we will start to follow the instructions of Volker, but I canâ t do it before tomorrow.

... but still any input is very welcome.

Heinz

Volker Halle · ‎08-31-2006

Heinz,

what you can do now, is:

NCL> SHOW NSP LOCAL NSAP *

note local NSAP address

NCL> SHOW NSP LOCAL NSAP local_nsap REMOTE NSAP * retransmitted pdus, duplicate pdus received

Repeat the same for OSI TRANSPORT ...

If all those counters are 0, you can forget about my theory. If not, we'll see...

Volker.

Cass Witkowski · ‎08-31-2006

To test for packet lost try using PING with a large packet size, say 10,000 bytes and a count of 100 packets

With a duplex mismatch you can see any where from 7% to 20% packet loss

If you have both NSP and OSI transports enabled on a node but only DECnet over IP working between the nodes then you will get a 30 second delay in the beginning as DECnet trys NCP first, timesout, and then tries DECnet over IP. We have a 6 node cluster with 3 nodes on one subnet and 3 on the other. Within a subnet NSP works but between the subnets only DECnet over IP works. We had to remove the nodes on the other subnet from the DECnet_Register so that it would only try DECnet over IP.

You may need to check the DECnet_Register to make sure the address data is correct.

Robert Walker_8 · ‎08-31-2006

Hi,

We had similar problems but with Tru64. Setting the 100M Full Duplex on the console doesnt work for Tru64, and may still be an issue with OVMS.

Try ftping from each machine a very large file to NLA0:[000000] if you have the network setup at 100M full duplex across the switches and hosts then TCPIP will transfer at over 9MB/sec. However if you find only one system is getting that and the other is getting much much less then you know that one system is probably running in half duplex.

In T64 you have to force it at the OS level as the console setting of the duplex is ignored.

Robert.

Volker Halle · ‎08-31-2006

Heinz,

If I do same test by using the scssystemid instead of the nodename the performance increases, but is still not good enough.

What do you mean by this ? What is the difference between using SCSSYSTEMID or nodename for the DTSEND test ? Is it selecting NSP via OSI transport ? Does the node name and the SCSSYSTEMID differ ?

What do the numbers mean in your spreadsheet ?

Your DTSEND is sending large packets in one direction and small ones in the other. This may make a difference. You can specify /TYPE=ECHO to have DTR send back the whole packet.

Volker.

Tim Hughes_3 · ‎08-31-2006

There are a few differences between DECNET over IP and straight DECNET V that can cause this:

- by default DECNET users larger packets that can lead to packet loss on a flaky network.
- DECNET uses an OSI IS-IS router. This can be an old router (DECxyz) buried in the network somewhere that may only have a 10Mb link.
- The DNS name lookup is different. This can cause slow link establishment as the DNS lookup list time outs.

I found the easiest way out is to force DECNET over IP by fiddling the address towers with DECNETREG or just removing the node from DECNETREG. I think there are better ways using NCL if you have got the time. Donâ t forget to do a NCL FLUSH CACHEâ ¦. Also the back translation for proxy access may change,

Tim

Colin Butcher · ‎08-31-2006

Hello Heinz,

It's worth working through all the layers involved. At the hardware layer do you have throughput issues with any other protocol on the same adapter (SCS, TCPIP etc.)? If not and if all the counters look OK (both in the switch and on the systems) then let's assume that the hardware layer is probably OK.

Next layer would be the virtual interface layer if you're using LAN failover. LANCP should show you the information you need.

I'm wondering if it's the DECnet transport layer. I generally set up DECnet-Plus to use either NSP (my preference) or OSI transport. I suggest that you modify the address tower information with DECNET_REGISTER and ensure that the target local (native DECnet) nodes only have a single address tower entry using NSP transport. Don't forget the infamous MCR NCL FLUSH SESSION CONTROL NAMING CACHE ENTRY "*" afterwards.

You can try flipping over to the OSI transport layer later if you wish, but I have (a few years ago) seen some odd performance issues with OSI transport and using BACKUP over DECdfs - that was related to large packets. Back then I found using NSP would be fine and using OSI TRANSPORT wasn't.

Setting up address tower entries for both transports will generally double up all the timeouts as the communications path will try one, then the other.

I usually remove the IP naming entries and have the IP name resolution provided by the local hosts file by setting the DOMAIN name server to 127.0.0.1 in NET$CONFIGURE.

Some of the stuff in here might help: http://h71000.www7.hp.com/openvms/journal/v5/index.html#decnet

It's difficult to guess what to try without being there and seeing it! Good luck.

Cheers, Colin.

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Heinz W Genhart · ‎08-31-2006

I checked the following:

As recommended by Volker I used NET$MGMT to look at 'Retransmitted PDUs' and 'Duplicate PDUs'

Duplicate PDUs 31714
Retransmitted PDUs 2468

The counters are increasing during a DTSEND test.

Monitor PROCESS/TOPBIO shows values between 200 and 2000 and is never nearly constant.
Monitor DECnet displays values between 1 and 6000

The output from
$ MC NCL show nsp local nsap 39756F11510031AA0004007EC520 -
remote nsap 39756F11510031AA0004007AC520 all
looks as follows:

Identifiers

Name = 39756F11510031AA0004007AC520

Status

NSAP Address = 39:756:11-51-00-31:AA-00-04-00-7A-C5:20 (LOCAL:.GDC140)
UID = 3C15766D-37A3-11DB-87F0-001321081234

Counters

Creation Time = 2006-08-29-19:13:36.480+00:00I0.150
Remote Protocol Errors = 4
Total Octets Received = 2078857053
Total Octets Sent = 1514367784
PDUs Received = 2392972
PDUs Sent = 2046628
Duplicate PDUs Received = 2650
Retransmitted PDUs = 271
Connects Received = 38
Connects Sent = 14
Rejects Received = 0
Rejects Sent = 0
User PDUs Discarded = 0
User Octets Received = 2054571743
User Octets Sent = 1496421003
User PDUs Received = 1589613
User PDUs Sent = 1093773

If I look at OSI Transport I cant find any NSAP which shows duplicate PDUs received or retransmitted PDUs

So far I think Volker is right, it seems, that we are loosing packets.

I tried also to follow the instructions of Cass. I pinged the another node as follows:
$ tcpip ping gdc141/number=100/packet_size=10000

----gdc141 PING Statistics----
100 packets transmitted, 100 packets received, 0% packet loss
round-trip (ms) min/avg/max = 2/2/3 ms

it seems, that We dont have packet loss over TCPIP

To Robert: I think we dont have the problem that the switch port setting does not correspond with the NIC settings.
We had this problem a long time ago. Anyway I will let the network guis check the switchports.

To Volker: The number in my spreadsheet are the line utilization displayed by DTSEND

Any ideas how to continue ?

Regards

Heinz

Volker Halle · ‎08-31-2006

Heinz,

you didn't answer my question about the meaning of using 'systemid' and 'nodename' for the DTSEND test.

You said that both nodes have 4 LAN interfaces, all configured for DECnet. When using IP (PING test), you may be only using a subset of those interfaces. When using DECnet, they may all be used (round-robin ?!).

Are these 4 network interfaces connected to the same LAN segment or to different LAN segments ? Do they all work, i.e. does a packet sent out via one of those interface to the other node really get received by the other node ?

Individual NCL LOOP tests via those 4 local LAN interfaces to each of the 4 remote LAN interfaces may tell you more.

MC NCL LOOP MOP CIRC csmacd-n ADDRESS aa-bb-cc-dd-ee-ff

Issue those tests for all CSMACD circuits n=0...3
to each of the 4 remote LAN interfaces

Volker.

Volker Halle · ‎08-31-2006

Heinz,

Monitor PROCESS/TOPBIO shows values between 200 and 2000 and is never nearly constant.
Monitor DECnet displays values between 1 and 6000

When I run a DTSEND test (using the 30 sec default time) between 2 systems on our (more or less empty) LAN, I get a pretty contstant packet/sec rates shown by MONI DECNET. You may want to verify this using a pair of nodes between which you're seeing acceptable performance (and no lost packets). MONI PROC/TOPBIO should even be a better indicator, as that will not count other DECnet traffic.

If MONITOR DECnet shows 6000 packets/sec, this is the rate you should get during DTSEND tests. DTSEND sends the packets as fast as possible, but I'm sure, it does not keep a number of QIOs outstanding (the buffered I/O count does not vary) - so it sends them one-by-one after receiving the response from the remote end.

Volker.

Heinz W Genhart · ‎09-01-2006

To Volker

We were using the scssystemid to be sure that nsp is in use and we found a difference, as you can see in the excel..

The interfaces are 2 dual DE600. From each DE600 one cable is connected to switch1 and one cable is connected to switch2. I dont have knowledge about the network topolgy. This because the network management is in the hands of a partner company. But all 4 interfaces of both machines are in same subnet.

We opened a call with the network management, that they check the switchports for errors and to verify, that all ports are set to 100 M full duplex.

I will start now to do further tests.

1. I will try the ncl loop commands to find out if only one interface has the problems
2. I will ensure, that only the NSP tower is defined in decnet_register and do the tests
3. I will ensure that onli the OSI tower is defined in decnet_register.

Regards

Heinz

Heinz W Genhart · ‎09-01-2006

I got the idea from Colin

I did again tests with DTSEND between the two cluster members, where we have our problem. I documented my measuring in the attached excel sheet.

It seems that the problem is the OSI transport.

If I define only the NSP tower, everything seems well
If I define only the TP4 tower, the results are bad ... see Excel
If I define both towers, the results are bad
If I use a node in another subnet, it will use DECnet over IP. The results are good.

If I use FTP to copy a 10 MB file, I can not find any speed difference on any node I tested. There is no difference if the machines are in same subnet or not (Node xyz is located at a remote site, 15 km away and in another subnet).

So TCPIP seems not to be the problem, DECnet NSP Transport seems also to be o.k.

OSI Transport is the problem

Regards

Heinz

Volker Halle · ‎09-01-2006

Heinz,

I still see a couple of open questions:

- you were seeing retransmitted PDUs with NSP between the 2 nodes, yet the DTSEND performance seems to be not as bad as with OSI transport.

- would NSP transport use the 4 routing circuits equally ? Or would only OSI transport do that ?

- did you test the 4 DECnet circuits between the 2 nodes ?

Volker.

Heinz W Genhart · ‎09-01-2006

Hi Volker

you are right, the problem is not completly solved.

NSP will not use the 4 routing circuits equally, but most of our network traffic
will use DECnet over IP and this way we use all 4 Interfaces.
I will continue analyzing why we have those retransmitted PDUs in Monday.

Regards

Heinz

Colin Butcher · ‎09-01-2006

Hello Heinz,

That's useful progress. It confirms something I've seen before. I suggest that you log a support call to try and get it fixed. I'm assuming that you're running the 'latest' version with all the current ECOs, in which case the problem is still in there and it still shows up occasionally.

Load balancing across all available adapters should happen with the NSP transport layer, but there are several things you need to consider. I'll assume that you're using Phase IV compatible addressing and that the address tower entries (yes you can have more than one if you have both Phase IV and Phase V style addresses) for NSP transport refer to the Phase IV compatible address, not a Phase V address.

Are the 4 LAN adapters connected to entirely separate LANs or VLANs where the only interconnection between them is by routing? If so then you can enable a Phase IV style address on all of the adapters. Given similar adapter performance and idential 'path costs' (to borrow from Phase IV land) then I'd expect to see the OSI End Systems load balanace across all available adapters.

If however the adapters are connected to the same LAN or VLAN then you can only have a Phase IV style address on one adapter in each LAN / VLAN because of the risk of a duplicate MAC address. Depending how the address tower entries are defined that may implicitly limit the number of adapters across which the routing layer will load balance the traffic.

It's all jolly good fun, isn't it?

Cheers, Colin.

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Heinz W Genhart · ‎09-10-2006

Hi Colin & Volker

The four NIC's are all connected to different switches, but all on same subnet.

In same subnet we have another 2 node ES45 Cluster. We did same tests on this cluster too, but we could not see the problem we have on our problem machine.

The network guis are involved since one week, but they could not yet find the problem and are stil working on it.

We don't have any DECnet routers. All DECnet traffic goes over DECnet over IP, except DECnet traffic from one Cluster member node to another one.

This is just a short update. We are stil working on this problem, but with less priority

Regards

Heinz

Volker Halle · ‎09-10-2006

Heinz,

the 4 DECnet routing circuits may be used equally by DECnet-OSI for outgoing messages. If one of them would have a problem, how would you detect this ?

From the description of your DECnet network usage, couldn't you stop 3 of these 4 DECnet routing circuits a time and test one-by-one ?

Redundancy may be very nice, but sometimes is quite contraproductive, if it comes to troubleshooting. And there always is the problem of determing if all redundant pathes are actually working.

Volker.

Colin Butcher · ‎09-12-2006

Hello Heinz,

Are the 4 switch ports part of the same VLAN (even though they're on different physical switches)? If so then beware the duplicate MAC address problem with Phase IV compatible addressing.

DECnet addressing is an entirely separate issue to IP addressing and IP subnets. Your problem seems to be with the DECnet transport layers (NSP / OSI TRANSPORT) and how DECnet is configured to use the multiple LAN adapters, not with DECnet using IP as a pseudo-transport and how IP is configured to use multiple LAN adapters. You presumably have other protocols too running over the same set of LAN adapters (MOP. SCS, LAT etc.).

"SHOW LAN" (maybe "SHOW LAN /FULL") at the "SDA>" prompt should show you the protocols and addresses in use on each NIC. You should see each protocol type as a separate virtual LAN device (eg: EWA3) where all EWAn: devices use the underlying physical device EWA0:

I think that a diagram and a map of protocols / addresses as configured on each NIC would help you a lot. That should show you the connectivity for each protocol and then you should be able to work out how to configure the different protocols to fail over or load balance across the available physical paths.

If you're still having trouble then maybe you should consider having someone come to help you with it.

Good luck.
Cheers, Colin.

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Heinz W Genhart · ‎09-28-2006

Hi volks

yesterday we could reboot the problem machine.
After rebooting, we measured again, but could no more reproduce our problem.

It's a pity, that we don't know realy, where the problem was. For this time, I think I close this thread, not at least because I will be out of the office for the next 3 weeks. I will deal with things like, Nitrox, underwater photography, sharks and hopefully with moonfishes too ....

Thank you very much for your comptent responses.

Best regards

Heinz

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Very bad Performance over native DECnet

Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet

Re: Very bad Performance over native DECnet