General
cancel
Showing results for 
Search instead for 
Did you mean: 

Node eviction due to lost heartbeat interconnect

BHEL
Advisor

Node eviction due to lost heartbeat interconnect

Hi
We have installed Oracle 11g database RAC on HP-UX 11iv3.Last week we had a system crash. We analysed the logs and found that ORACLE CRS has initiated the crash. On further analysing the oracle clusterware logs we have found that the node eviction is due to cluster interconnect lost i.e: heartbeat fatal eviction and the possible action suggested by oracle si to checkthe availability of networks (heartbeat) and the os logfiles for reported error related to the interconnect.
But we didn't find any such errors in the OS part(HP-UX 11iv3)Kindly tell us what are all the logs to be checked for heartbeat link failure in the os part.

13 REPLIES
Kapil Jha
Honored Contributor

Re: Node eviction due to lost heartbeat interconnect

what does syslog.log and OLDsyslog.log
from /var/adm/syslog directory says at the time of crash.

BR,
Kapil+
I am in this small bowl, I wane see the real world......
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

The OLDsyslog has msg as below:
Aug 21 08:03:00 bap02 ntpdate[8599]: the NTP socket is in use, exiting.

And the syslog contains messages that are captured after reboot
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

where else w ecan find the network related errors in HP-UX 11iv3
ManojK_1
Valued Contributor

Re: Node eviction due to lost heartbeat interconnect

Hi,

Please paste the ouput of

netfmt -f /var/adm/nettl.LOG000

This will give the deatils of link down and up.


Manoj K
Thanks and Regards,
Manoj K
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi,
Here is the log....
***********************************STREAMS/UX*******************************@#%
Timestamp : Sat Aug 21 IST 2010 09:40:09.276563
Process ID : 4822 Subsystem : STREAMS
User ID ( UID ) : 500 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 09:40:09 165915 1 T.. 2224 24 tl_wput:T_OPTMGMT_REQ:out of state, state=10

***********************************STREAMS/UX*******************************@#%
Timestamp : Sat Aug 21 IST 2010 09:40:09.277966
Process ID : 4822 Subsystem : STREAMS
User ID ( UID ) : 500 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 09:40:09 165915 1 T.. 2224 24 tl_wput:T_OPTMGMT_REQ:out of state, state=10

***********************************STREAMS/UX*******************************@#%
Timestamp : Sat Aug 21 IST 2010 09:40:09.279800
Process ID : 4822 Subsystem : STREAMS
User ID ( UID ) : 500 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 09:40:09 165915 1 T.. 2224 24 tl_wput:T_OPTMGMT_REQ:out of state, state=10

***********************************STREAMS/UX*******************************@#%
Timestamp : Sat Aug 21 IST 2010 09:40:09.280474
Process ID : 4822 Subsystem : STREAMS
User ID ( UID ) : 500 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4 09:40:09 165915 1 T.. 2224 24 tl_wput:T_OPTMGMT_REQ:out of state, state=10

ManojK_1
Valued Contributor

Re: Node eviction due to lost heartbeat interconnect

Hi,

1) In clusterwarelog which time it is showing the interconnect lost?
2) Does ServiceGuard configured in the servers?
3) Which user is having User ID(UID) 500 ?
4) Attach clusterwarelog?
5) Paste netstat -in output?
6) Whcih lan is using for public and private?

Manoj K
Thanks and Regards,
Manoj K
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi
1)In clusterwarelog which time it is showing the interconnect lost
2010-08-21 08:23:06

2)Does ServiceGuard configured in the servers?
No

3)Which user is having User ID(UID) 500 ?
Oracle


4) Attach clusterwarelog?Attached:

2010-08-21 08:23:06.294
[cssd(3474)]CRS-1612:node bap01 (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 08:23:07.294
[cssd(3474)]CRS-1612:node bap01 (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 08:23:14.294
[cssd(3474)]CRS-1611:node bap01 (0) at 75% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 08:23:18.294
[cssd(3474)]CRS-1610:node bap01 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 08:23:19.294
[cssd(3474)]CRS-1610:node bap01 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 08:23:20.294
[cssd(3474)]CRS-1610:node bap01 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2010-08-21 09:24:25.928
[cssd(3518)]CRS-1605:CSSD voting file is online: /dev/oracle/asmvot1. Details in /home/oracle/product/CRS/log/bap02/cssd/ocssd.log.
[cssd(3518)]CRS-1601:CSSD Reconfiguration complete. Active nodes are bap01 bap02 .
2010-08-21 09:24:26.862
[evmd(3298)]CRS-1401:EVMD started on node bap02.
2010-08-21 09:24:26.956
[crsd(3309)]CRS-1012:The OCR service started on node bap02.
2010-08-21 09:24:29.419
[crsd(3309)]CRS-1201:CRSD started on node bap02.
201


5)Paste netstat -in output

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
lo0 32808 127.0.0.0 127.0.0.1 113371872 0 113372555 0 0
lan901 1500 10.7.3.0 10.7.3.201 164089015 0 122297967 0 0
lan900 1500 10.7.1.0 10.7.1.201 580696289 0 1176359227 0 0
lan900:801 1500 10.7.1.0 10.7.1.206 256740362 0 22666 0 0


6)Which lan is using for public and private
All are private only.

Thanks in Advance..
ManojK_1
Valued Contributor

Re: Node eviction due to lost heartbeat interconnect



>>>6)Which lan is using for public and private
>>>All are private only.
is not clear.

Run the command "oifcfg getif" and provide the output.

There was any time differenec between the RAC nodes?

Manoj K
Thanks and Regards,
Manoj K
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi

Sorry ignore that...
Here is the output of oifcfg

lan901 10.7.3.0 global cluster_interconnect
lan900 10.7.1.0 global public

Thanks in advance
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi
There is no time difference between the RAC nodes
ManojK_1
Valued Contributor

Re: Node eviction due to lost heartbeat interconnect


After Crashing the DB whether the server rebooted?

At what time time the DB got up in the crashed server?

Can you please compare the clusterware logs from both the servers, Verify that during crash the time is same in both the nodes?

There is no network fail implecation in OS level.

If it is not a time sysnc issue, the crash might be because of the Inteconnect between the nodes got fully occupied and communication got stuck.
You can check whether any log from switch side will be available.

Manoj K
Thanks and Regards,
Manoj K
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi
We checked up with cisco and they reported no network disconnect at that time.
Shall i send the crash dump file generated at that time
BHEL
Advisor

Re: Node eviction due to lost heartbeat interconnect

Hi
Herewith i am attaching the ts99 file