1824219 Members
4150 Online
109669 Solutions
New Discussion юеВ

HUB900

 
Wim Van den Wyngaert
Honored Contributor

HUB900

We have an FDDI based cluster with a dual intersite connection and 2 hub900 in each site.
This weekend a reconfiguration is going to be done : 1 of the 2 sets of fibers is going to be replaced by new ones and a switch from multimode to single mode will be done on the hub900 (only for the intersite connection, the nodes stay connected in multimode).

According to HP this can be done without risk for the VMS nodes.

Did anyone had the same kind of intervention with problems ? The cluster is running Swift software. If not, other horror stories may be posted too.
Wim
9 REPLIES 9
Jan van den Ende
Honored Contributor

Re: HUB900

Wim,
misschien kan je hierna iets rustiger slapen:

Hardly a horror story.

We BUILT our cluster at a single location, multi-mode FDDI. Then we shut down a node, transfered it to the remote site, and brought it up via single mode. Actually, after that we brought dow two nodes-in-a-single-cabinet, moved them as well, and then removed the first-to-move from the cluster, because that WAS our spare/test/extra/sometimes-tru64 system, which we brought in just to have enough members while moving the two-node cabinet.
(All this was done in 1998, and the cluster is still running)

Well, we DID also have an ethernet connection between the two sites, so that COULD maintain connection at a lower performance just-in-case, but we did NOT notice that a fallback has occurred.

Maybe you can sleep more relaxed now (although I remember that all of us also were a little bit jumpy before the event).

You will of course report after the event I hope?

Success!!!


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: HUB900

Jan,

I'm asking it because recently :

1) replacing hot swappable batteries resulted in a dismount of a shadow member (device found offline while it wasn't).

2) fiber quality was "unstable" and this resulted in "excessive packet loss" and finally a crash (LOCKMGRERR). This while there is failover (a second FDDI). All cluster nodes did a clueexit (recnxinterval too low).

So, while doing the intervention on the hub900, bad fibre quality or other things could make things unstable.
Wim
Keith Parris
Trusted Contributor

Re: HUB900

Are the two hub900s (MultiSwitch 900s from Digital Network Products Group, I assume) in each site bridged together, or independent? Are there 2 LAN adapters in each VMS node, one connected to each of the two hub900s at each site?

The reason I ask is that if the 2 hub900s are bridged together, then if both inter-site links are working, the Spanning Tree protocol may turn off one of the two inter-site links at any given time (thus wasting inter-site bandwidth you may need for things like shadow copies or merges). And in terms of failure detection, if one of the inter-site links in such a bridged configuration breaks, the bridges will use the other link, but you may not notice that it is broken, until the 2nd one also fails, because it's transparent to VMS.

When there are multiple inter-site links, I recommend that for cluster communications (SCS) they be kept separate, with separate LAN adapters in each node connecting to each of the sets of independent bridges, and let PEDRIVER select the appropriate path, or use multiple paths at once for greater throughput.

I also recommend that you set up LAVC$FAILURE_ANALYSIS if you haven't already, to allow you to track failures in the LAN configuration (see the EDIT_LAVC.COM tool from the [KP_CLUSTERTOOLS] directory of the V6 Freeware CD for a way to automate setup of this tool). This way, OPCOM messages will be generated identifying failed (and repaired) LAN components.

I would recommend temporarily raising the RECNXINTERVAL parameter on all nodes during the cutover and testing (it's a dynamic parameter) so the cluster will be more tolerant of temporary disruptions and if you break something during the cutover, hopefully you will have enough time to put it back or correct it before the cluster is disrupted.

I'd also monitor ReXMT (and ReRCV) rates under SDA or SCACP both before and after the conversion, to detect any increase in lost or corrupted (and thus re-transmitted) packets, or any lost ACKs (causing packets to be re-sent unnecessarily and thus re-received again).
Wim Van den Wyngaert
Honored Contributor

Re: HUB900

Keith,

The are interconnected but not bridged.
We have a link between 1 lan adapter (DEFPA)and 2 hub ports, each in another hub900.

I will check the analysis tool later.

Wim
Wim
Jan van den Ende
Honored Contributor

Re: HUB900

Well, then.

Wim, I guess now I understand your concerns a bit better. I wonder what is not-up-to-standard with your FDDI.
The way I always understood things, FDDI uses essentially a token-passing-ring mechanism, which implies dual fibers via a double path between sites. If one path is broken, your token-and-message train uses the "upstream" and "downstream" fiber of the now-only-single connection. TWO breaks WILL disrupt connectivity. So if you are going to do any activity on some component of your ring, you should check beforehand that BOTH paths are operational; for the duration of those activities you will NOT be redundant.
So: do I not have a correct view of your configuration, or is something not redundant where it should, or has some failed redundancy gone unnoticed?

Keith:
Am I missing something? I was under the impression that Spanning Tree had to do with Ethernet type of configurations, which MAY NOT have active rings, and the Spanning Tree protocol is used to guard the 'gaps' between parts of the network that CAN be connected, and will GET connected if the 'left' and 'right' side of the gap are no longer connected elsewhere in the network.
A RING like FDDI on the other hand SHOULD normally consist of a ring.

Correct me if I am wrong.

OTOH, if your interconnect IS (any form of) ethernet, and Spanning Tree is used, then, indeed, RECNXINTERVAL should be longer then the Spanning Tree failover duration.
I was (remotely) involved in a cluster that used Giga-ethernet (with 100 Mb as redundancy) for cluster-interconnect, and that behaved like you described: the Spanning Tree failover time was about 40 - 50 seconds, and the cluster only survived those failovers after RECNXINTERVAL was increased (on all nodes) to 60 seconds.
Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: HUB900

Jan,

1 of the 2 interbuilding fibers is already down because instable. Now it is going to be replaced (now could be later because now we have bad fibers).
Our recnxinterval is already on 120 seconds.

Above all, I am afraid that we get the green light and that FDDI becomes unstable or so afterwards.
And there is no test environment. Only production.

Wim
Jan van den Ende
Honored Contributor

Re: HUB900

Wim,

do you have/can you enable an Ethernet connection next to your FDDI?
That could also provide redundancy.
Then again, if that uses the same fibre bundle, and your entire bundle is compromised, then I am afraid you will be forced to performing some high-up circus stunts without safety-net. But probably, the longer you wait, the more some stuff will detoriate, and the longer your production cluster will live in the danger zone.

If you know WHICH parts are unreliable, of course you should replace those first, but..

I think I 'm afraid I'm now down to wishing you all kinds of good luck. You might need it.

Ik denk dat mijn "Ga maar rustig slapen" een beetje voorbarig was..

Sterkte,

and success!!

(you will report the outcome, I trust?)

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Keith Parris
Trusted Contributor

Re: HUB900

You can ignore the first two paragraphs of my earlier reply. The rest should still apply.

Jan,

The "hub900" nomenclature threw me off course. I was thinking MultiSwitch 900, but he probably actually has DEChub 900s, which are a 6-port FDDI concentrator.

So of course bridging and Spanning Tree is not involved, just an FDDI ring.

Sorry for my confusion.
Wim Van den Wyngaert
Honored Contributor

Re: HUB900

Very late report.

Don't let them touch the connectors of the working fibers. An unplug/plug may break the connector. So, no fast switch to test something.

The MX900 may not survive an unplug to change ports.

The intervention failed but there were no serious problems.

Wim
Wim