Operating System - HP-UX
1832997 Members
2269 Online
110048 Solutions
New Discussion

Re: MC/SG primary node is down (crashing)

 
gonzalo_7
New Member

MC/SG primary node is down (crashing)

Hi All,
I have a cluster with two nodes running oracle (rel 7.3.2.3.0) and a custom program (these are my PKG) on HP-UX B.10.20. Suddenly the node A failed and halted all process, then MC/SG start the PKG on node B, where the PKG is running properly now, however I lost the primary server. I had tried to switch manually to node A but I get the same fault and MC/SG switch to server B. Could somebody help me with this problem please. I have attached a file with some part of log files.
11 REPLIES 11
Marvin Strong
Honored Contributor

Re: MC/SG primary node is down (crashing)

Well just glancing over your log real quick I would focus on this part:

--------------------------------
Error Timeout:AllStatusEnd file.
Error Timeout. SNMP Extensible Agent Statup Failure
----------------------------------

That is coming from one of your scripts, it seems. It seems as though something is not configured the same between the two nodes.

any more clues in the syslog?

RAC_1
Honored Contributor

Re: MC/SG primary node is down (crashing)

You need to find out, why node A is crashing.
Check /etc/hutdownlog
Check syslog.log

Does it generated the crash dump?

Anything in /var/tombstones/ts99 file?

Anil
There is no substitute to HARDWORK
Sridhar Bhaskarla
Honored Contributor

Re: MC/SG primary node is down (crashing)

Hi,

recv error!! errno=242

If this is a system error, then it corresponds to "no route to host".

I would first see what changed on Primary Node.

Look at your OLDsyslog.log at the time of crash . Also try to compare the versions of DCE products installed on both the nodes.

It also says "/etc/cmcluster/toolkit/oracle/oracle.cntl[6]: 8753 Killed"

Something made the control script to be killed. Verify your package configuration parameters and see if you have
NODE_FAIL_FAST_ENABLED and SERVICE_FAIL_FAST_ENABLED are set to yes. If so, then this behaviour is expected.

-Sri
-Sri
You may be disappointed if you fail, but you are doomed if you don't try
Radhakrishnan Venkatara
Trusted Contributor

Re: MC/SG primary node is down (crashing)

IS node A is part of ur cluster now .Please send cmviewcl -v o/p. Try to get cmgetconf o/p.

regards

Radhakrishnan
Negative thinking is a highest form of Intelligence
Ashwani Kashyap
Honored Contributor

Re: MC/SG primary node is down (crashing)

Looking at the logs it seems that your control script times out creating "AllStatusEnd" file and then time out on starting SNMP Extensible Agent Statup .

This is picked by monitoring service ORACLE_RFT and sisnce its a faill of a service it shuts the package down , the MCSG starts it on the second node .

Looks like somthing is configured differently application wise on both the nodes thats why it is running on one and not on other .

There were some DCE errors in the beginning . Please also ensure that you have sam version and patch levels of dce on both the nodes and they are running .
gonzalo_7
New Member

Re: MC/SG primary node is down (crashing)

Hi and Thanks All for yours responses.

Excuse me if the problem description is not very clear, but I am new in HP-UX and MC/SG.

Unfortunately I didn't get a syslog when the server crashed, however I've restarted both server several times with the same results, I've attached a zip file with syslogs and cluster cfg files.

The parameters NODE_FAIL_FAST_ENABLED and SERVICE_FAIL_FAST_ENABLED are both set to NO.

The Node A is now a member of cluster is up and running, but it is not the current server, please see the cmviewcl output at the end of file attached before.

I've performed the commands #cmquerycl -v -n nodeA -n nodeB -C cfg_cluster.log.
and #cmcheckconf -v -C cfg_cluster.log.
It seems to me no error were found (I included the log in zip file attached).

The command cmgetconf didn't work.

regards.,

Gonzalo
Sridhar Bhaskarla
Honored Contributor

Re: MC/SG primary node is down (crashing)

Hi (Again),

//
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Communication to node rf05sape has been interrupted
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Attempting to form a new cluster
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Communication with node rf05sape has been interrupted
//

The above is suspecious. Make sure the network interfaces are all up and running, particulary heartbeat interfaces.

-Sri
You may be disappointed if you fail, but you are doomed if you don't try
gonzalo_7
New Member

Re: MC/SG primary node is down (crashing)

Hi All,

The lan interfaces looks ok, lanscan and ping commands work ok, please see the log files attached inside zip file.

I have attached a zip file with our cluster configuration files, if it can help you to get some idea about our MC/SG environment.

regards.

Gonzalo.
Sridhar Bhaskarla
Honored Contributor

Re: MC/SG primary node is down (crashing)

Hi

HEARTBEAT_INTERVAL 1000000
NODE_TIMEOUT 2000000

Your node will timeout if it doesn't receive two successive heartbeats. Your heartbeat timeout is only 1 second. It may be causing the issue.

I know you would post the question that the same configuration is working fine before. Yes. But something might have changed on the system elsewhere later that may be causing the interfaces to lock temporarily. DCE errors about network failures second it. I would look at parameters like buffer cache, memory utilization etc., that are not causing intermittent freezes on the system.

If the system crashes again, then send the core dump to HP for more analysis.

-Sri
You may be disappointed if you fail, but you are doomed if you don't try
Mohanasundaram_1
Honored Contributor

Re: MC/SG primary node is down (crashing)

Hi ,

Sridhar is correct. The timeout interval should atleast be 4 times the node timeout.

Apart from that I strongly suspect something has changed in the services control on your Node A. It will be useful if you post the /etc/cmcluster//*.log from the failing node and also the syslog.log of the failing node.

Check the cmrunserv part on this NODE A. You sure have something changed there.

Cheers,
Mohan.
Attitude, Not aptitude, determines your altitude
Kent Ostby
Honored Contributor

Re: MC/SG primary node is down (crashing)

On the HP MC/SG team, the recommendation is for node_timeout to be 8 to 10 seconds or 8000000.

This does not overly delay the restart of the package but does keep away some failures where there is a brief LAN outage.

Best regards,

Kent M. Ostby
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"