Re: Cluster Interconnect problem

Song_Charles · ‎10-28-2009

Hi,
My customer just upgraded their application envirnoment, OpenVMS upgrade to Version V8.3 with ECOs. one EVA8100 and five ES40s had be formed a one-sys_disk cluster. the cluster interconnection were used fastFD NI(DE602,one card two ports on each ES40).
after cluster running continuously about 10 days, today 3 nodes were be down, the quorum lost.
while I was at site, I found there were many errors on PEA0 Channel under SCACP, I think it isn't normal status.
My question is,
1.how should I do to improve the cluster reliability?

2.could I dedicated LAN segments be used for cluster communication?

Thanks,
Charles Song

工作着并享受生活

Volker Halle · ‎10-28-2009

Charles,

the first question to be answered should be: why were the 3 nodes down ? What happened ? Crashed and unable to reboot ? Hung ?

Exactly which counters did you look at with SCACP ? Some 'errors' may just be normal.

Are both ports of the DE602 LAN card connected to the same of different LAN(s) and both being used for cluster communications ?

Do the switch ports and the LAN devices agree on speed and duplex settings ?

Volker.

The Brit · ‎10-28-2009

Charles,
I would also check your OPERATOR.LOGs, on all nodes, for any occurrences of "CNXMAN" and "PEA", since you booted.

The logs are at SYS$MANAGER:OPERATOR.LOG (note this is node-specific)

Also need more information about infrastructure, i.e. (as suggested by Volker), how do your NIC Ports connect physically to the network switch(es), are the paths between nodes redundent, etc.

Dave

Song_Charles · ‎10-28-2009

Hi,
customer told me that their ORACLE server couldn't access from client side, then they reset all 5 systems and reboot, the cluster rebuild and was normal.

one port(EIA) on each ES40 were connected to public network and the other(EIB) was connected to private networkand, EIB port wasn't be configurated with any network protocol, just connected cable to switch.

under SRM, I set EIA0_MODE and EIB0_MODE to
FastFD and also set under LANCP utility.

the SCACP counter be put in Attachment.

for this is the remote customer, I had ask customer that help me to get the OPERATOR.LOGs on each ES40.

Thinks
Charles Song

工作着并享受生活

Volker Halle · ‎10-28-2009

Charles,

the XMT:Tmo ratio does not look very healthy. It is in the 400-600 range, which indicates, that a retransmission due to a transmit timeout is happening every 400-600 packets. I normally see values in the range of 20000 and much higher.

WSR1 EIB might have a duplex mismatch problem, check the switch-port, whether it agrees with 100 Mbit FDX and is not set to auto-negotiation.

The 'problem description' your customer has given is not really helpful. Please educate your customer to force a system crash instead of just 'resetting' the systems, if something appears hung. Some PING tests could have probably helped to determine, which - if any system - was hung. You might need the console output to determine, what was wrong.

Volker.

Volker Halle · ‎10-28-2009

Charles,

if you look at the Channel Rexmit Errors, you see that they are much higher on the EIB LAN adapter from WSR1 to all other 4 nodes, than on the EIA LAN adapter, so there seems to be some problem on that LAN (most likely speed/duplex/auto-negotiation settings). You need to check this data on all nodes.

I don't believe, that whatever problem had happened, that it has been caused by 'cluster interconnect problems'.

Volker.

Song_Charles · ‎10-28-2009

Volker,
I will tell customer to make force crash, while next time the cluster be out of working.
I check and set EIx0_mode carefully while I intsalled the cluster, but I couldn't confirm the switch side's values.
Why the XMT:Tmo ratio was so low, and how can I adjust?
another question, could I dedicated one NIC port be used as cluster connection? and I want to set the port to twist-pair 10Mb mode, I think this mode will be more reliable?

Thanks.
Charles Song

工作着并享受生活

Volker Halle · ‎10-28-2009

Charles,

you need to find out the switch port settings. Otherwise, you could set the OpenVMS side to auto-negotiation and hope that the switch also is set to auto. You can dynamically try this with LANCP first (i.e. LANCP> SET DEV EIB/AUTO) and check, if the duplex mode mismatch errors disappear. With LANCP SHOW DEV/INT EIB you are able to see the LAN driver error messages, which are otherwise only visible on the console terminal.

Setting the EIx LAN interface to 10 Mbit HDX may only make things worse.

You cannot improve the XMT:Tmo ratio except by making sure, that the underlying network works reliably, especially the EIB LAN, as indiciated by the higher number of Channel Rexmit Errors.

Don't blame the cluster protocol for what has happened, until after you understand what has really happened ! The cluster communication protocol is very reliable and if there is a working channel between the systems, it will find and use it !

Volker.

Song_Charles · ‎10-28-2009

Volker,
good idea, I will change the values at LANCP, and check the result on just one ES40. in my case, all two ports may be used as cluster communication channel?

Thanks
Charles Song

工作着并享受生活

Volker Halle · ‎10-28-2009

Charles,

as MC SCACP SHOW CHAN shows, you have 2 channels between each of the ES40s. One channel is formed between the EIA LAN interfaces of both nodes and the other channel between the EIB LAN interfaces, both network segments are NOT connected to each other.

As long as one of those 2 channels is functioning, the virtual circuit between the nodes is intact and cluster messages can be exchanged.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Cluster Interconnect problem

Cluster Interconnect problem