Re: MC/SG primary node is down (crashing)

gonzalo_7 · ‎06-16-2004

Hi All,
I have a cluster with two nodes running oracle (rel 7.3.2.3.0) and a custom program (these are my PKG) on HP-UX B.10.20. Suddenly the node A failed and halted all process, then MC/SG start the PKG on node B, where the PKG is running properly now, however I lost the primary server. I had tried to switch manually to node A but I get the same fault and MC/SG switch to server B. Could somebody help me with this problem please. I have attached a file with some part of log files.

Marvin Strong · ‎06-16-2004

Well just glancing over your log real quick I would focus on this part:

--------------------------------
Error Timeout:AllStatusEnd file.
Error Timeout. SNMP Extensible Agent Statup Failure
----------------------------------

That is coming from one of your scripts, it seems. It seems as though something is not configured the same between the two nodes.

any more clues in the syslog?

RAC_1 · ‎06-16-2004

You need to find out, why node A is crashing.
Check /etc/hutdownlog
Check syslog.log

Does it generated the crash dump?

Anything in /var/tombstones/ts99 file?

Anil

There is no substitute to HARDWORK

Sridhar Bhaskarla · ‎06-16-2004

Hi,

recv error!! errno=242

If this is a system error, then it corresponds to "no route to host".

I would first see what changed on Primary Node.

Look at your OLDsyslog.log at the time of crash . Also try to compare the versions of DCE products installed on both the nodes.

It also says "/etc/cmcluster/toolkit/oracle/oracle.cntl[6]: 8753 Killed"

Something made the control script to be killed. Verify your package configuration parameters and see if you have
NODE_FAIL_FAST_ENABLED and SERVICE_FAIL_FAST_ENABLED are set to yes. If so, then this behaviour is expected.

-Sri
-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Radhakrishnan Venkatara · ‎06-16-2004

IS node A is part of ur cluster now .Please send cmviewcl -v o/p. Try to get cmgetconf o/p.

regards

Radhakrishnan

Negative thinking is a highest form of Intelligence

Ashwani Kashyap · ‎06-16-2004

Looking at the logs it seems that your control script times out creating "AllStatusEnd" file and then time out on starting SNMP Extensible Agent Statup .

This is picked by monitoring service ORACLE_RFT and sisnce its a faill of a service it shuts the package down , the MCSG starts it on the second node .

Looks like somthing is configured differently application wise on both the nodes thats why it is running on one and not on other .

There were some DCE errors in the beginning . Please also ensure that you have sam version and patch levels of dce on both the nodes and they are running .

gonzalo_7 · ‎06-16-2004

Hi and Thanks All for yours responses.

Excuse me if the problem description is not very clear, but I am new in HP-UX and MC/SG.

Unfortunately I didn't get a syslog when the server crashed, however I've restarted both server several times with the same results, I've attached a zip file with syslogs and cluster cfg files.

The parameters NODE_FAIL_FAST_ENABLED and SERVICE_FAIL_FAST_ENABLED are both set to NO.

The Node A is now a member of cluster is up and running, but it is not the current server, please see the cmviewcl output at the end of file attached before.

I've performed the commands #cmquerycl -v -n nodeA -n nodeB -C cfg_cluster.log.
and #cmcheckconf -v -C cfg_cluster.log.
It seems to me no error were found (I included the log in zip file attached).

The command cmgetconf didn't work.

regards.,

Gonzalo

Sridhar Bhaskarla · ‎06-16-2004

Hi (Again),

//
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Communication to node rf05sape has been interrupted
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Attempting to form a new cluster
Jun 11 17:26:29 rf05sbpe cmcld[7770]: Communication with node rf05sape has been interrupted
//

The above is suspecious. Make sure the network interfaces are all up and running, particulary heartbeat interfaces.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

gonzalo_7 · ‎06-16-2004

Hi All,

The lan interfaces looks ok, lanscan and ping commands work ok, please see the log files attached inside zip file.

I have attached a zip file with our cluster configuration files, if it can help you to get some idea about our MC/SG environment.

regards.

Gonzalo.

Sridhar Bhaskarla · ‎06-16-2004

Hi

HEARTBEAT_INTERVAL 1000000
NODE_TIMEOUT 2000000

Your node will timeout if it doesn't receive two successive heartbeats. Your heartbeat timeout is only 1 second. It may be causing the issue.

I know you would post the question that the same configuration is working fine before. Yes. But something might have changed on the system elsewhere later that may be causing the interfaces to lock temporarily. DCE errors about network failures second it. I would look at parameters like buffer cache, memory utilization etc., that are not causing intermittent freezes on the system.

If the system crashes again, then send the core dump to HP for more analysis.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Mohanasundaram_1 · ‎06-19-2004

Hi ,

Sridhar is correct. The timeout interval should atleast be 4 times the node timeout.

Apart from that I strongly suspect something has changed in the services control on your Node A. It will be useful if you post the /etc/cmcluster//*.log from the failing node and also the syslog.log of the failing node.

Check the cmrunserv part on this NODE A. You sure have something changed there.

Cheers,
Mohan.

Attitude, Not aptitude, determines your altitude

Kent Ostby · ‎06-23-2004

On the HP MC/SG team, the recommendation is for node_timeout to be 8 to 10 seconds or 8000000.

This does not overly delay the restart of the package but does keep away some failures where there is a brief LAN outage.

Best regards,

Kent M. Ostby

"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: MC/SG primary node is down (crashing)

MC/SG primary node is down (crashing)