1832552 Members
5970 Online
110043 Solutions
New Discussion

Re: Alert 13

 
SOLVED
Go to solution
Ben Wood_2
Occasional Contributor

Alert 13

My system crashed and the errorlog showed an Alert 13:
"System hang detected via timer popping". I am new to the HP-UX environment and am looking for guidance on how to troubleshoot this. Does this type of error warrant making a service call to HP?

Thanks, Ben
9 REPLIES 9
Ken Hubnik_2
Honored Contributor

Re: Alert 13

One of my servers paniced last night and I received the same alert. I just got off the phone with HP support and basically that alert indicates your system paniced due to a hang condition and it is rebooting. They suggested to me that it was not hardware related and either the O/S or third party software created a hang condition causing the reboot. I am looking into my application logs for errors around the reboot time.
Jeff Schussele
Honored Contributor
Solution

Re: Alert 13

Hi Ben,

Depends - this can be caused by a CPU going bad & not releasing a resource that another CPU is waiting on.
OR it could be caused by bad code.

You need to look at several things:
1) The /etc/shutdownlog to see what actually initiated the shutdown
2) The GSP error log for any events that hint at HW failure.
3) The crash dump to see exactly what caused the "hang". You may need HP's expertise to decipher the dump.
4) The tombstone created by this event - it may spotlight a bad CPU.

Either way I'd log a call to the Responce Ctr & let them decide whether to open a HW or SW ticket. But you definitely to investigate root cause & monitor this system closely.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Ken Hubnik_2
Honored Contributor

Re: Alert 13

Pertaining to Jeffs response. I logged a call with HP support and we went through all 4 of those steps and everything was clean (meaning no errors). So if you check these and they are clean then could be the software cause the panic.
Jeff Schussele
Honored Contributor

Re: Alert 13

Again that depends.
First question should be - What's changed?
Any new code installed - be it OS or application?
If not & this code has been running w/o incident for some time, then I'd still lean towards a HW scenario.
Does the SW log anything? Have you checked those logs?

Ken - did HP have you ftp the entire dump to them - tombstone, shutdownlog & swlist -l patch output?
If not I'd insist on it. A quick look at the dump & tombstone can easily miss the problem.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Ken Hubnik_2
Honored Contributor

Re: Alert 13

I sent them everything but the swlist. FYI check this link out that has just been posted this afternoon on the server board.

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xe05d0fe6d0f7d61190050090279cd0f9,00.html
Jeff Schussele
Honored Contributor

Re: Alert 13

Yea - I saw that.

I had an N reboot like that several weeks ago, that looked all the world to be a SW hang. But we did spot "trace" residue in the GSP of a HW problem - but VERY unspecific msgs in all. The tombstone was blank - literally all zeros. But the dump held the clue that CPU 1 had hung up causing CPU 5 - that was waiting on a resource that CPU1 was NEVER gonna release - to pop the timer. But it took HP a day to find it in the dump.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Ken Hubnik_2
Honored Contributor

Re: Alert 13

Jeff what did HP recommend as a fix. Did they take any action?
Jeff Schussele
Honored Contributor

Re: Alert 13

Yes - CPU 1 was replaced.
No trouble since.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Ben Wood_2
Occasional Contributor

Re: Alert 13

I checked the suggested logs and /var/tombstones. There was not any relevant info in either. I talked to HP and they are going to bring out a CE. Thanks for pointing me in the right direction.

--Ben