1833994 Members
3800 Online
110063 Solutions
New Discussion

Re: one node hung reason

 
SOLVED
Go to solution
Shivkumar
Super Advisor

one node hung reason

Hi,

We have a 2 node serviceguard cluster running oracle 9i rac. Our sys admin team found one node in hung state
and rebooted the server manually. I think that the hung server could have gone for TOC and rebooted itself.

Can someone explain what could be this hung situation and why it had not gone for crash dump and rebooting ?

Thanks,
Shiv
7 REPLIES 7
Torsten.
Acclaimed Contributor

Re: one node hung reason

Not every "hang" is the same.

You can feel the system is hung, but the system doesn't - one scenario.

The system is hang and will crash - even in non-sg environment.

Your control script is not working as expected.

And some more...

You need to analyze the logs to get the reason.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Deoncia Grayson_1
Honored Contributor

Re: one node hung reason

The node could have panic for any reason as Torsten stated, you have to investigate the logs to find exactly why. You can look at the /etc/shutdownlog and also look in your syslogs, you also check /var/adm/crash to see if created a crash log.



Deonia
If no one ever took risks, Michelangelo would have painted the Sistine floor. -Neil Simon
Bill Hassell
Honored Contributor
Solution

Re: one node hung reason

Thyere is no reason to assume that the hang condition will cause a TOC/panic. You have to determine the state of the system. When you say 'hung', is that being measured with a network connection? The system may be running fine but a network router has failed so the link between your device (like a PC) is broken. When the system hangs, you must go to the console so you can bypass the networking. If you can get logged in, you'll have to determine if local networking is still working.

If the console login fails, there may be a hardware failure and your cluster failover is not correctly setup.


Bill Hassell, sysadmin
inventsekar_1
Respected Contributor

Re: one node hung reason

i have a question SHiv,
how ur sys admins team found one node in hung state? is there any command?
i am curious to know that..

and what is meant by "crash" in unix?
sometimes my windows system crashed and when we install windows again we missed only the c drive datas. what will happen in unix crash? when we install unix os again, what datas we can get or recover?
Be Tomorrow, Today.
Shivkumar
Super Advisor

Re: one node hung reason

Sekar,
Someone miscommunicated to us that it was in hung state. In fact the system panicked and rebooted itself. So it seems to be an expected behaviour.

Thanks,
Shiv
Bill Hassell
Honored Contributor

Re: one node hung reason

Actually, a system panic is not a 'normal' state. Your systems need up to date patches. It is not unusual for HP-UX systems to run for years and never have a panic and reboot. But these systems should be fully patched ever 4 to 6 months.


Bill Hassell, sysadmin
Michael Steele_2
Honored Contributor

Re: one node hung reason

A panic and reboot is not the same as a 'hung' system. See Mr. Hassell's description.

You should investigate your logs files:

/var/adm/tombstones/ts98, ts99 (* HPMC's?*)
OLDsyslog.log (* Read from bottom up *)

SERVICE GUARD
cmreadlog /var/opt/cmon/cmomd.g
cmreadlog /var/opt/sgmgr/929917sgmgr.log
cmscancl -n node -o outputfile

SAVECRASH
/var/adm/crash/*

GSP > sl > e
Support Fatherhood - Stop Family Law