Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster Interconnect problem

 
SOLVED
Go to solution
Song, Charles
Frequent Advisor

Cluster Interconnect problem

Hi,
My customer just upgraded their application envirnoment, OpenVMS upgrade to Version V8.3 with ECOs. one EVA8100 and five ES40s had be formed a one-sys_disk cluster. the cluster interconnection were used fastFD NI(DE602,one card two ports on each ES40).
after cluster running continuously about 10 days, today 3 nodes were be down, the quorum lost.
while I was at site, I found there were many errors on PEA0 Channel under SCACP, I think it isn't normal status.
My question is,
1.how should I do to improve the cluster reliability?

2.could I dedicated LAN segments be used for cluster communication?

Thanks,
Charles Song
工作着并享受生活
11 REPLIES 11
Volker Halle
Honored Contributor

Re: Cluster Interconnect problem

Charles,

the first question to be answered should be: why were the 3 nodes down ? What happened ? Crashed and unable to reboot ? Hung ?

Exactly which counters did you look at with SCACP ? Some 'errors' may just be normal.

Are both ports of the DE602 LAN card connected to the same of different LAN(s) and both being used for cluster communications ?

Do the switch ports and the LAN devices agree on speed and duplex settings ?

Volker.
The Brit
Honored Contributor

Re: Cluster Interconnect problem

Charles,
I would also check your OPERATOR.LOGs, on all nodes, for any occurrences of "CNXMAN" and "PEA", since you booted.

The logs are at SYS$MANAGER:OPERATOR.LOG (note this is node-specific)

Also need more information about infrastructure, i.e. (as suggested by Volker), how do your NIC Ports connect physically to the network switch(es), are the paths between nodes redundent, etc.

Dave
Song, Charles
Frequent Advisor

Re: Cluster Interconnect problem

Hi,
customer told me that their ORACLE server couldn't access from client side, then they reset all 5 systems and reboot, the cluster rebuild and was normal.

one port(EIA) on each ES40 were connected to public network and the other(EIB) was connected to private networkand, EIB port wasn't be configurated with any network protocol, just connected cable to switch.

under SRM, I set EIA0_MODE and EIB0_MODE to
FastFD and also set under LANCP utility.

the SCACP counter be put in Attachment.

for this is the remote customer, I had ask customer that help me to get the OPERATOR.LOGs on each ES40.

Thinks
Charles Song
工作着并享受生活
Volker Halle
Honored Contributor

Re: Cluster Interconnect problem

Charles,

the XMT:Tmo ratio does not look very healthy. It is in the 400-600 range, which indicates, that a retransmission due to a transmit timeout is happening every 400-600 packets. I normally see values in the range of 20000 and much higher.

WSR1 EIB might have a duplex mismatch problem, check the switch-port, whether it agrees with 100 Mbit FDX and is not set to auto-negotiation.

The 'problem description' your customer has given is not really helpful. Please educate your customer to force a system crash instead of just 'resetting' the systems, if something appears hung. Some PING tests could have probably helped to determine, which - if any system - was hung. You might need the console output to determine, what was wrong.

Volker.
Volker Halle
Honored Contributor

Re: Cluster Interconnect problem

Charles,

if you look at the Channel Rexmit Errors, you see that they are much higher on the EIB LAN adapter from WSR1 to all other 4 nodes, than on the EIA LAN adapter, so there seems to be some problem on that LAN (most likely speed/duplex/auto-negotiation settings). You need to check this data on all nodes.

I don't believe, that whatever problem had happened, that it has been caused by 'cluster interconnect problems'.

Volker.