Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

DECnet cluster alias breaks outgoing connections

 
Jeremy Begg
Trusted Contributor

DECnet cluster alias breaks outgoing connections

Hi,

 

This site has a dozen VMSclusters configured more-or-less identically: each has a pair of AlphaServer GS1280s running VMS 8.3 and a pair of BL8060c i2 blades running 8.4.  Each cluster has DECnet-Plus and TCP/IP Services, and there are cluster aliases for each cluster.

 

The problem concerns the two Integrity blades ("MEBT03" and "MEBT04")  in one cluster ("MEBA"). When these machines make an outgoing DECnet connection that has the "Outgoing Alias" attribute set to "True", the connection usually fails.


For example, trying to list a directory on a remote node or open a file typically results in an error like

 

%DIRECT-E-OPENIN, error opening MELT04::*.*;* as input
-RMS-E-FND, ACP file or directory lookup failed
-SYSTEM-F-UNREACHABLE, remote node is not currently reachable

 

after a few minutes.  Occasionally the connection succeeds (still with some delay before results come back) but only because it was able to switch network stacks before the outgoing timer expired.

 

Setting the "Outgoing Alias" attribute to False makes the problem go away, but then the remote node associates the actual local nodename with the connection rather than the cluster alias name.  This isn't a huge issue, it mainly affects DECnet proxies, and at least it gives us a workaround.

 

As part of the problem investigation I ran SYS$SYSTEM:CDI$TRACE on the remote node and initiated another DIRECTORY command on MEBT04, see the attached file.  It seems to show a number of incoming connection attempts, all of which are correctly associated with the MEBA cluster alias, so my suspicion is that whatever acknowledgement is supposed to be sent back to the originating node isn't being received.  On this occasion the originating node eventually tried connecting using TCP/IP rather than NSP, and that worked straight away.  (The nodes are configured to try NSP first then TCP/IP because NSP usually works and saves having to recreate a whole lot of proxy entries.)

 

I'm just wondering where to look next.  The naming cache has been flushed and the DECnet local database appears correct on all nodes.  As far as I can tell, the hardware and DECnet configuration on these two servers is the same as on the other Integrity servers (with the obvious exception of the actual hostnames and addresses).  If someone can point me to a specific configuration issue which would cause this behaviour, I can try harder to find it.

 

Thanks,

Jeremy Begg

 

7 REPLIES 7
Benjamin Levy
Frequent Advisor

Re: DECnet cluster alias breaks outgoing connections

Hello

 

I think one possible explanation is that the two nodes in question are end-nodes, and there aren't any L1 routers on the same segment with those two nodes.

 

MCR NCL SHOW ROUTING TYPE

 

MCR NCL SHO ROUTING CIRC CSMACD-xxx ADJ * LAN ADDR, NEIGHBOR NODE TYPE

you need at least one adjancent PHASE-V router in order for the cluster alias to work right.

 

black_cat
Advisor

Re: DECnet cluster alias breaks outgoing connections

Jeremy,

 

as you may well know DECnet/Plus End Systems (Phase V) should not need a Router to connect to a DECnet Cluster Alias.

The inital communication (if no entry exists in the End-Node Cache) is done via multicasts (ALL-ES). 

 

Are these systems in the various clusters multi-circuit (more than one DECnet circuit) end systems?

 

What  do you see in the End-Node Cache on the initating and the receiving system when you initate a connection?

 

SDA> net show routing cache

 

To see what is really happening, you'll have to use something like Wireshark to get a trace.

Unfortunately, the analysis is a bit cumbersome, as Wireshark only associates NSP running over PhaseIV Routing protocol and not over ISO 8473 as in this case.

 

John

 

Jeremy Begg
Trusted Contributor

Re: DECnet cluster alias breaks outgoing connections

Hi,

 

All of the Itanium servers (21 of them) are configured identically, it's only two of them which are showing this problem.  Note that the issue concerns initiating an outgoing connection.  It doesn't matter which node or cluster alias  they try to connect to.  If I set "Outgoing Alias = False" on the FAL object, the outgoing connection works without any problems.  But we'd rather have "Outgoing Alias = True".

 

All of them have two DECnet CSMA-CD circuits, the routing type is Endnode, and there is a DECnet router.

 

For example, on MEBT04 ...

 

MEBT04> mcr ncl
NCL>show session control application fal all char

Node 0 Session Control Application FAL
at 2012-11-07-09:55:12.443+11:00Iinf

Characteristics

    Client                            = <Default value>
    Addresses                         =
       {
          number = 17
       }
    Outgoing Proxy                    = True
    Incoming Proxy                    = True
    Outgoing Alias                    = True
    Incoming Alias                    = True
    Node Synonym                      = True
    Image Name                        = SYS$SYSTEM:FAL.EXE
    User Name                         = <Default value>
    Incoming OSI TSEL                 = <Default value>
    OutgoingAlias Name                = <Default value>
    Network Priority                  = 0

NCL>show routing type

Node 0 Routing
at 2012-11-07-09:55:17.979+11:00Iinf

Characteristics

    Type                              = Endnode

NCL>show routing circuit *

Node 0 Routing Circuit CSMACD-0
at 2012-11-07-09:55:22.371+11:00Iinf

Identifiers

    Name                              = CSMACD-0


Node 0 Routing Circuit CSMACD-1
at 2012-11-07-09:55:22.371+11:00Iinf

Identifiers

    Name                              = CSMACD-1

NCL>show routing circuit csmacd-0 adj * lan addr, neighbor node type

Node 0 Routing Circuit CSMACD-0 Adjacency RTG$0001
at 2012-11-07-09:56:03.970+11:00Iinf

Status

    LAN Address                       = AA-00-04-00-1F-08 (LOCAL:.MELR01)
    Neighbor Node Type                = Phase V Router

NCL>show routing circuit csmacd-1 adj * lan addr, neighbor node type

Node 0 Routing Circuit CSMACD-1 Adjacency RTG$0001
at 2012-11-07-09:56:09.051+11:00Iinf

Status

    LAN Address                       = AA-00-04-00-1F-08 (LOCAL:.MELR01)
    Neighbor Node Type                = Phase V Router

NCL>

 

 

Here's the routing cache (as shown by SDA) on the target system (MELT04) when I issue a DIR MELT04:: command on MEBT04:

 

DECnet-OSI for OpenVMS Routing ES Cache Dump
--------------------------------------------

Routing Prefix DataBase Address B0739EF8
Prefix Table Start: 91A4E4CC , End: 91A4E6CC, Size 0

Routing Cache DataBase Address B0739EE0
Cache Table Start: 91A4FF0C , End: 91A5070C, Size 4
    Cache Entry at Address 91A55714
        NSAP:
                      4900 02AA0004 00160820  .....ª..I       00000000
            NSP Transport - (2.22)
        Cache Circuit Entry Count : 2,  Probe Count: 894
        Cache Circuit List: 91A579C0
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   002B
            Holding Time:         012C
            Data Link Address:
                               41CF 3DA41700 ..¤=ÏA           00000000
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   012C
            Holding Time:         012C
            Data Link Address:
                               41CF 3DA41700 ..¤=ÏA           00000000

    Cache Entry at Address 91A5569C
        NSAP:
                      4900 02AA0004 001B0820  .....ª..I       00000000
            NSP Transport - (2.27)
        Cache Circuit Entry Count : 1,  Probe Count: 851
        Cache Circuit List: 91A571E0
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   012B
            Holding Time:         012C
            Data Link Address:
                               76CF 3DA41700 ..¤=Ïv           00000000

    Cache Entry at Address 91A556EC
        NSAP:
                      4900 02AA0004 00320820  .2...ª..I       00000000
            NSP Transport - (2.50)
        Cache Circuit Entry Count : 1,  Probe Count: 997
        Cache Circuit List: 91A5E570
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   0121
            Holding Time:         012C
            Data Link Address:
                               18DC 3DA41700 ..¤=Ü.           00000000

    Cache Entry at Address 91A556C4
        NSAP:
                      4900 02AA0004 00360820  .6...ª..I       00000000
            NSP Transport - (2.54)
        Cache Circuit Entry Count : 1,  Probe Count: 0
        Cache Circuit List: 91A5E5A0
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Reverse
            Blocksize:            Non-FDDI
            Remaining LifeTime:   024D
            Holding Time:         0258
            Data Link Address:
                               0836 000400AA ª...6.           00000000
                (2.54)

The initiating node (MEBT04) has DECnet address 2.54 and its cluster alias (MEBA) is 2.50, both of which appear in the cache dump above.

 

And here's the cache on MEBT04 at the same time:

 

DECnet-OSI for OpenVMS Routing ES Cache Dump
--------------------------------------------

Routing Prefix DataBase Address AE739EF8
Prefix Table Start: 91A6CCCC , End: 91A6CECC, Size 0

Routing Cache DataBase Address AE739EE0
Cache Table Start: 91A6E70C , End: 91A6EF0C, Size 3
    Cache Entry at Address 91A73EC4
        NSAP:
                      4900 02AA0004 00030820  .....ª..I       00000000
            NSP Transport - (2.3)
        Cache Circuit Entry Count : 1,  Probe Count: 995
        Cache Circuit List: 91A75D50
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseIV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   00A5
            Holding Time:         0258
            Data Link Address:
                               0803 000400AA ª.....           00000000
                (2.3)

    Cache Entry at Address 91A73F14
        NSAP:
                      4900 02AA0004 00150820  .....ª..I       00000000
            NSP Transport - (2.21)
        Cache Circuit Entry Count : 1,  Probe Count: 144
        Cache Circuit List: 91A7D1A0
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   0127
            Holding Time:         012C
            Data Link Address:
                               14DC 3DA41700 ..¤=Ü.           00000000

    Cache Entry at Address 91A73EEC
        NSAP:
                      4900 02AA0004 001D0820  .....ª..I       00000000
            NSP Transport - (2.29)
        Cache Circuit Entry Count : 1,  Probe Count: 999
        Cache Circuit List: 91A75D80
        Cache Circuit Entry:
            Type:                 BroadCast
            Format:               PhaseV
            Reachability:         Direct
            Blocksize:            Non-FDDI
            Remaining LifeTime:   011D
            Holding Time:         012C
            Data Link Address:
                               367C 77A41700 ..¤w|6           00000000

 

MELT04 has address 2.29 and that's the last entry in the cache listing.

 

All of these nodes have OPENVMS-I64-MCOE licences so maybe I could try changing the node routing type to L1 router, but I don't see why it should be necessary on two nodes out of 21.

 

Thanks,

Jeremy Begg

 

black_cat
Advisor

Re: DECnet cluster alias breaks outgoing connections

Jeremy,

I just tested your configuration(?) in a similar environment and it worked for me:

V8.3 OpenVMS Alpha Multicircuit ES => V8.4 OpenVMS IA64 Multicircuit ES in a router (OpenVMS host based) environment.
These are the DECnet versions I was using:
 
    Implementation                    =
       {
          [
          Name = OpenVMS I64 ,
          Version = "V8.4    "
          ] ,
          [
          Name = HP DECnet-Plus for OpenVMS ,
          Version = "V8.4 ECO02 14-JUN-2012 16:46:39.40"
          ]
       }
    Implementation                    =
       {
          [
          Name = OpenVMS Alpha ,
          Version = "V8.3    "
          ] ,
          [
          Name = HP DECnet-Plus for OpenVMS ,
          Version = "V8.3 ECO03 25-NOV-2008 15:49:46.19"
          ]
       }

What surprised me from your output was that you only had a 'Cache Circuit Entry Count' of 1, even on your MEBT04 system,
considering you have 2 active DECnet circuits over which you can see the Router adjacencies.
I assume you are running a meshed and not a dual-railed LAN?
Are you using a host-based router?

Would it be possible to collect some more info, possibly in the form of an attachment.

On MEBT04, MELT04 and MELR01
$ mc lancp sho config
$ mc ncl show implementation
$ mc ncl sho csma-cd stat * all stat
$ mc ncl sho routing all char
$ mc ncl sho routing circ * all

On MEBT04, MELT04
$ mc ncl sho address
$ mc ncl sho alias port * all
$ mc ncl sho session control all

On MELR01 (router)
$! Address 2.50
$mc ncl sho rou dest node  AA-00-04-00-32-08 all
$! Address 2.54
$mc ncl sho rou dest node  AA-00-04-00-36-08 all
$! Address 2.29
$mc ncl sho rou dest node  AA-00-04-00-1D-08 all

John

PS. I may not respond for the next few days as I'm attending the local TUD.

Jeremy Begg
Trusted Contributor

Re: DECnet cluster alias breaks outgoing connections

Hi John,

 

Thanks for helping out.  I have attached the output from those commands for MEBT04 and MELT04.

MELR01 is a router box somewhere (I've never seen it), it's not a VMS system and I don't have access to it.

 

I should explain the LANCP configuration.  These are blade servers configured using Virtual Connect and each blade gets assigned 16 virtual ethernet interfaces by the hardware, even though there are only two physical ethernet ports on the chassis.  We have configured 6 of the 16 ethernet ports for use by VMS:

 

EWA0 & EWB0 are reserved for SCA (cluster traffic)

EWD0 & EWL0 are dedicated to a TCP/IP subnet used for backing up filesystems across the network

EWI0 & EWJ0 are general purpose used for DECnet and TCP/IP.

 

I have raised a support case with HP.  If you want to continue to assist in my problem feel free to do so.

I shall share the results either way.


Regards,

Jeremy Begg

Jeremy Begg
Trusted Contributor

Re: DECnet cluster alias breaks outgoing connections

Well, I have an update of sorts.

 

HP logged in and had a look and found a number of "issues" with our DECnet configuration.  But I'm not convinced that they explain why we're seeing this problem on only one of our many VMSclusters.

 

One thing I've found which might be significant: the router for the test/dev systems has a DECnet-IV address which is higher than all the test/dev node addresses EXCEPT for the nodes in the "problem" cluster.

 

On our production systems, the two production routers' DECnet-IV addresses are higher than all the production nodes.

 

Is it possible that our problem could be due to the DECnet address, and if so, why?

 

Thanks,

Jeremy Begg

black_cat
Advisor

Re: DECnet cluster alias breaks outgoing connections

Jeremy,

 

from the output that you made available I found nothing untoward except in combination with the following statement you made


"MELR01 is a router box somewhere (I've never seen it), it's not a VMS system and I don't have access to it."

 

Can you access it via NCL ?

NCL> SHOW NODE 2.31 all

 

If it is not a old DIGITAL router (eg. DECNIS) or a host based router then I would expect that your NCL ROUTING characteristic attribute 'Routing Mode' would be set to SEGREGATED and not INTEGRATED as in your case, as most non DEC router implementations were based on SIN (Ships In the Night).  But I imagine this setting is also present on all the other cluster systems where you made your connectivity tests(?).

 

What do you mean when you say:

"the router for the test/dev systems has a DECnet-IV address which is higher than all the test/dev node addresses EXCEPT for the nodes in the "problem" cluster."

 

If all your systems are DECnetPlus End Systems, you will only need a router if you are routing between areas. As I have said before, unlike DECnet PhaseIV Systems, you will not need a router to serve a Cluster Alias.

 

Once this is hopefully resolved you can join me in persuading Engineering to move the tcpdump functionality from the TCP stack down to the LAN driver level (as was the case with Tru64) so that we resolve these problems a lot quicker than is the case now with utitilities such as CTF.

 

John