HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
cancel
Showing results for 
Search instead for 
Did you mean: 

CPU failure

 
Wim Van den Wyngaert
Honored Contributor

CPU failure

I had a machine check indicating a cpu failure.

I would like to know what chance I have that after this the cpu will never fail again.

I had some cpu failures before and certain cpu's never gave the failure again. But maybe I was just very lucky.

Wim
Wim
9 REPLIES
Hein van den Heuvel
Honored Contributor

Re: CPU failure


This is NOT an official answer by any strech, but froma practial point of view we find you can indeed often get away with a spurious CPU failure.

Of course I just work with lab/test equipment, not on production systems so I can just try and try again.

Sometimes we just reboot, and if we have the chance will power down, re-seat modules, power up. In doing so the problem has often gone away for ever (or long enough we no longer remmeber an earlier failure).

Of course we also have a 'three strikes and you are out rule'. One failure... bad luck. reboot. Two failures... hmm let's try the power-cycle + jiggle routine. Three failures... in the thrash pile.

Grins,
Hein.

Wim Van den Wyngaert
Honored Contributor

Re: CPU failure

Hein,

I was thinking of doing a replacement after a 2nd failure. So you agree, allthough not officially.
Wim
Craig A Berry
Honored Contributor

Re: CPU failure

Obviously keep an eye on the error log, and if your system supports it you may want to monitor possibly relevant details with the POWER_VECTOR, FAN_VECTOR, TEMPERATURE_VECTOR, and THERMAL_VECTOR item codes to $GETSYI. I believe WEBES and probably other management tools will keep track of these for you. These may or may not tell you anything, but it would be a shame to replace a CPU when the real problem was a temperature or power glitch.
Mike Naime
Honored Contributor

Re: CPU failure

Wim:

This is more a question of
a.) How important is keeping that system up, b.) How much the downtime costs you.
C.) Do you have a support contract that covers the replacement, or does it come out of your pocket?

If it is a test/development machine that you really do not care about... So what if it craps out on you once a month or so. A power cycle will often clear a problem on a CPU that was looping bugchecks.

If it is a 24x7x365 system... can you take that chance?

It really is an unknown.

I'm changing out a CPU on one of our production boxes later tonight that caused a crash last Saturday. I have Platinum support on production system, and our management pushes for the replacement even though HP recommends waiting for a second outage. To us, it is not worth the second outage. We have it replaced from the onsite spares!

Mike Naime
VMS SAN mechanic
Wim Van den Wyngaert
Honored Contributor

Re: CPU failure

Checked the details :

Webes :
Bcache tag parity error reported by CPU1, CPU Slot1 of SoftQbb0 (HardQbb0)

Simular case found in which they say that it might be an overheated cpu. So, a 1 time event ?

http://groups.google.com/groups?num=100&hl=nl&lr=&ie=UTF-8&q=%
22analyzing+hw+error+on+21164LX%22

In any case, replacing the cpu is also a risk. The new one may have problems too. In 2 years, we have replaced 2 out of 4 cpu's in the GS160. So a new cpu can be a bigger risk than keeping the old one.
Wim
Matt West
Advisor

Re: CPU failure

In my experience you should look to exchange suspect hardware at your earliest convenience. Some CPU routines can be overcome by simply upgrading your base firmware as the routine may not have been strictly hardware related. That said once a CPU displays a hard error it is likely to reoccur when you least want it to. My advise is arrange downtime and swap this unit out by contacting the support center.

Mohamed K Ahmed
Trusted Contributor

Re: CPU failure

If you have multiple CPU's, one good practice is to swap the faulty one with one that is not causing errors, like CPU0 <--> CPU1. Then see if the same errors will appear on the faulty one or will propagate to the good one.
This way you will have done 2 things at the same time:
Re-seated the boards
Pointed to the exact faulty module


Mohamed
Wim Van den Wyngaert
Honored Contributor

Re: CPU failure

This is 24/24. No swapping possible. And no others cpu available.

The system is up since 2 days now without any problems.

Wim

Wim
Lawrence Czlapinski
Trusted Contributor

Re: CPU failure

We had a production CPU bug check 21 days ago on a 2 CPU Alpha DS20 500 MHZ. We also run 24X7 so unless it happens again we will leave it alone. In our case, the CPU came back up by itself.