Operating System - HP-UX
1758887 Members
2825 Online
108876 Solutions
New Discussion юеВ

Serviceguard failed to switch over to local standby interface

 
Ra Jose
Regular Advisor

Serviceguard failed to switch over to local standby interface

Hello all,

We have a prod oracle database cluster
running on 2-nodes. Each node is rx8640
running 11iv2, SG 11.17. One node runs
oracle database package, other is stand-by.

Each node has 2 quad-ported nic cards.
One card (with all 4 ports aggregated
and has lan900 interface) is primary lan.
Secondary lan (with all 4 ports aggregated and
has lan901 interface) is stand-by.

When this cluster went into production about
an year ago, we tested failover/failback for
these NIC interfaces. We were successfully
able to failover to lan901 when lan900 was
pulled out and vice-versa.

The lan900 goes to network primary switch,lan901
goes to secondary switch.

Today we had issues on primary network switch
and some of the blades on this cisco switch
had hardware failures. So our cluster lost lan900 connection and cmcld noticed this failure on lan900.

It tried to switch to lan901, but failed.

Here is what I get in syslogs

Nov 17 07:36:17 f1e1pd05 cmcld[8598]: lan900 failed
Nov 17 07:36:17 f1e1pd05 cmcld[8598]: Subnet 10.10.32.0 switching from lan900 to lan901
Nov 17 07:36:17 f1e1pd05 cmcld[8598]: lan900 switching to lan901
Nov 17 07:36:17 f1e1pd05 cmcld[8598]: Failed to switch 10.10.32.58 from lan900(0,0) to lan901(3,a0a2000): Device busy
Nov 17 07:36:18 f1e1pd05 cmcld[8598]: Link level address on network interface lan900 has been updated from 0x0018714e325e to 0x0018714e325d.
Nov 17 07:36:18 f1e1pd05 cmcld[8598]: Sending file $SGRUN/frdump.cmcld.4 (512096 bytes) to file assistant daemon.
Nov 17 07:36:18 f1e1pd05 cmcld[8598]: Unable to set socket buffer size to 524288 bytes (No buffer space available), continuing anyway.
Nov 17 07:36:18 f1e1pd05 cmfileassistd[23323]: Updated file /var/adm/cmcluster/frdump.cmcld.4 (length = 512096).
Nov 17 07:36:19 f1e1pd05 cmcld[8598]: Failed to switch 10.10.32.58 from lan900(0,0) to lan901(3,a0a2000): Device busy

Has anybody seen this. Looks like the device busy on lan901 caused cmcld not to failover.
But this is a standby interface, what would cause it busy.

Any ideas/suggestions are welcome. Thank you all.

Ra Jose.
5 REPLIES 5
freddy_21
Respected Contributor

Re: Serviceguard failed to switch over to local standby interface

hi Ra Jose,

please check connectivity between lan900 and lan901 in both server.

My guess:
1. Lan900 and lan901 not in same segment.
2. please provide ouput: netstat -in or netstat -nr.


please check with this command:
at server A
linkloop -i 900 < mac address 901>
linkloop -i 901 < mac address 900>
linkloop -i 900 < mac adress 901 at server B>
linkloop -i 900 < mac adress 900 at server B>
linkloop -i 901 < mac adress 901 at server B>
linkloop -i 901 < mac adress 900 at server B>


at server B
linkloop -i 900 < mac address 901>
linkloop -i 901 < mac address 900>
linkloop -i 900 < mac adress 901 at server A>
linkloop -i 900 < mac adress 900 at server A>
linkloop -i 901 < mac adress 901 at server A>
linkloop -i 901 < mac adress 900 at server A>

if one of command can't connected. You must fix it first.

Thanks
Freddy
RAC_1
Honored Contributor

Re: Serviceguard failed to switch over to local standby interface

APA and serviceguard is little difficult to handle. but anyways, how many cards are there under 900 and 901? It tried moving from 900 to 901. Does it mean all cards under 900 had problems?
There is no substitute to HARDWORK
Wim Rombauts
Honored Contributor

Re: Serviceguard failed to switch over to local standby interface

Maybe the error message is what the problem is : Device busy.
Is there any IP-address configured on lan901 during normal operation ? Maybe something else was already sing lan901, what made it unavailable as failover interface for ServiceGuard.
Ra Jose
Regular Advisor

Re: Serviceguard failed to switch over to local standby interface

Thank you all for looking into this.

I also put in a case with level-2 Serviceguard team (HP). They were able to identify that the stand-by lan (lan901 in
our case), was in UP status.

The ifconfig lan901 showed UP status and that is why SG saw it as busy and could not failover. Looks like, that seemed to be the root cause. We are testing this in our qc/test cluster. I will post it shortly on this.

The apa/sg have worked well in our clusters.
We use apa on almost all servers. The prod
database servers have quad-ported NIC. This
card goes into one of the PCI slots. So it
takes one slot and has 4 nic ports on it.
We aggregate all these 4 ports and form a
lan900 interface on which we plumb server IP.

Similarly we have another one on another slot, which we make it lan901 and this becomes our stand-by interface for SG cluster.

Ra Jose.
Stephen Doud
Honored Contributor

Re: Serviceguard failed to switch over to local standby interface

Check to see if the standby LAN NIC has this ifconfig statement:
lan2: flags=842
inet 0.0.0.0 netmask 0

Note that the NIC is not configured "up".