1833129 Members
3951 Online
110051 Solutions
New Discussion

Server crash

 
SOLVED
Go to solution
Narayanan_2
Advisor

Server crash

My DS20(dual processor, 4.0F) server crashed twice today. The first time it was just a blank screen even when halt was pressed. After a cold boot it was running ok..but only for about 7 hours. Then it crashed again, this time there was a blue screen with some information. The error shown in console is :

Halted CPU 1
CPU 0 is not halted
halt code = 7
machine check while in PAL mode
PC = 1D0C0
warning too many processor corrected errors detected on cpu 0. Reporting suspended.

On cold boot the server managed to boot but worried might crash again. Pls advise

16 REPLIES 16
Mohamed  K Ahmed
Trusted Contributor
Solution

Re: Server crash

As the message is saying, there are too many errors on the CPU0.
Check the /var/adm/messages file for more information about the error.
Most of the times this happens, you will need to replace the CPU.
Call HP support and let them know that you have a problem with the CPU crashing. Hop you have service agreement with them.

If you do not have a service agreement, you can troubleshoot the problem by swapping the 2 CPU's that you have (since the problems are reported from CPU 0) and disabling the faulty CPU (which is now CPU 1)from the SRM console commands.

P000> show cpu_enabled
most probably it will be ff.
change it to 1 to just enable one CPU and disable others
P000> set cpu_enabled 1
and boot the system and monitor it.

HTH
Mohamed
Michael Schulte zur Sur
Honored Contributor

Re: Server crash

Hi,

you may look with decevent into binary errorlog and see, what the cpu is complaining about. If so, can you post it?
Otherwise I concur with Mohamed, replace it.

Michael
Ralf Puchner
Honored Contributor

Re: Server crash

machine check could be:
memory, cpu, cache

so please open a call within an HP support center and provice binary.errlog for further investigation. This is a hardware issue.
Help() { FirstReadManual(urgently); Go_to_it;; }
Narayanan_2
Advisor

Re: Server crash

Thanks all. I have contacted Hp and replaced the cpu. The binary errorlog did not contain enough information pinpointing to cpu,memory problem. We did take any risk so we replaced the cpu.
Ralf Puchner
Honored Contributor

Re: Server crash

It sounds curious for me, that binary.errlog doesn't contain enough information in case of machine check.

Are you sure the binary.errlog was properly analyzed?
Help() { FirstReadManual(urgently); Go_to_it;; }
Narayanan_2
Advisor

Re: Server crash

Ralf,
Binary.errlog was analyzed by HP engineer and he couldn't find anything in it. After crash, the binary.errlog contained a lot of corrupted data(1010...)
Alexey Borchev
Regular Advisor

Re: Server crash

As far as I know, in normally logging to binary.errlog is done by CPU0.
Thus, when any other CPU crashes, binary.errlog typically does have errors logged. If CPU0 crashes - it's often not.
Exactly we've seen.

The fire follows shedule...
Ralf Puchner
Honored Contributor

Re: Server crash

I've often seen cpu errors logged within binary.errlog specially if it was the only cpu in the system.

A "too many processor corrected errors" indicates, that the cpu tries to correct the problem so logging must be done within binary.errlog. Maybe the second sentence "reporting suspended" gives us a clue that reporting stopped due to the errors.
Help() { FirstReadManual(urgently); Go_to_it;; }
Ramesh.K.R.
Regular Advisor

Re: Server crash

Hi Narayan & others,

I am also facing the same problem in one of our Alpha servers.
At the time of boot, the console shows the following error:
"Too many processor corrected errors detected on cpu (8, 16 & 24 -- the m/c has 4 cpus).Reporting suspended"

Does this indicate that all the 3 cpu's are gone !!!!!!!!!

What is the best course of action for me ??

Thanks & Regards,
Ramesh.K.R.
hai
Ralf Puchner
Honored Contributor

Re: Server crash

open a case within the HP support center!
It seems a cpu/memory/cache problem!
Help() { FirstReadManual(urgently); Go_to_it;; }
Mobeen_1
Esteemed Contributor

Re: Server crash

Guys,
I bet you would have decevent on your Alphas. Why don't you try and run it to look for precise errors.

In any case the error "Too many processor corrected errors on CPU0" suggest that you may need to have the CPU0 replaced.

Are your Alphas on h/w support from HP. If they are, please log a call with them and have them replace CPU0. If they don't want to replace the CPU0, i would suggest that you create a crash dump by force (from the boot prompt issue Ctrl+P) and pass that on to HP for review.

Hope this helps

Keep us updated.

regards
Mobeen
Ramesh.K.R.
Regular Advisor

Re: Server crash

Hi All,

Many thanks for the quick response. I will definitly book a call with HP support. What i wanted to know in the mean time was, this error is repeated for cpu no's 8, 16 & 24. So, doeas it mean all 3 cpu's are having problem?? or only the cpu "0" ??

Regards,
Ramesh.K.R.
hai
Ralf Puchner
Honored Contributor

Re: Server crash

Ramesh,

time will tell ;-). the binary.errlog is the first step to analyze the problem but it is useless to do that without the programs and register information HP staff have.

But this problematic was written several times here in the forum, so please read first and follow on the HP support center way, there is nothing we can do here at this point!
Help() { FirstReadManual(urgently); Go_to_it;; }
Ramesh.K.R.
Regular Advisor

Re: Server crash

OK ... thanks everyone........
I will update this forum, once i have any furthur info on this.

Regards,
Ramesh.K.R.
hai
Karthik S S
Honored Contributor

Re: Server crash

Ramesh any updates on this?? Jagga is having a similar problem. Refer,

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=507209

-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Mobeen_1
Esteemed Contributor

Re: Server crash

Ramesh,
Were your CPUs replaced as we guessed, let us know the outcome of your call with HP

rgds
Mobeen