Operating System - HP-UX
1824987 Members
3174 Online
109678 Solutions
New Discussion юеВ

Failed CPUs, nPars and vPars

 
SOLVED
Go to solution
Eric Yruegas
Frequent Advisor

Failed CPUs, nPars and vPars

General question for you all before I potentially stick my foot in my mouth and raise a stink over an issue I'm seeing.

I've got about a dozen rx8640's, all running HPUX11iv2, and all are running vPars - anywhere from 4 to 8 vPars per 8640. Nothing out of the ordinary there.

Lately I've had a string of failed CPUs, and each time it's caused an HPMC and the whole nPar gets reset, which of course pulls the rug out from under all the running vPars. Ugh.

Unless I'm way off base - servers such as the 8640 are built with lots of fault-tolerance. They can 'handle' a failed CPU, DIMM, etc., without too much fuss. Or at least that's how it's supposed to work. I don't recall my PA-RISC machines rebooting upon detecting a failed CPU in the past. Even the technician who's been replacing the CPUs has told me that a failed CPU shouldn't be causing the whole server to reset.

Has anyone else seen this behavior? We're running vPars A.04.04.04, for what it's worth.

I had another failure over the weekend, and I'm waiting on Support to go over the chassis logs. If it's another CPU I'm going to elevate... I really don't think this should be happening.

Am I way off base in my thinking? Feedback appreciated.
2 REPLIES 2
Solution

Re: Failed CPUs, nPars and vPars

Eric,

Sorry, but you are wrong here...

We absolutely had HPMCs on PA-RISC systems and they would absolutely always result in the rebooting of a whole nPar. All HP-UX boxes have always had something called "Dynamic Processor Resiliency", which can detect a "failing" CPU and take it offline, but you have to "detect" that failure... some CPU failures can be caught and flagged (from which we can do a DPR deactivation), and some can't - seems like you've been getting unlucky to me.

The problem is that some folks read the "DPR" material and interperet it as "HP UNIX servers can withstand CPU failures *under all conditions*", and that isn't and won't be the case. The same is true for Sun UNIX systems, IBM UNIX systems, in fatc pretty much every "open system" that's out there in the market. There are exceptions like the HP NonStop server, but they get round the issue by "lock stepping" the same instrauctions through 2 CPUs for resiliency.

And if the technician who's swapping bthe CPUs (presumably a HP employee) thinks that HP UNIX servers can catch all CPU failures without a reboot, he needs to speak with some of the internal labs folks to get that straight.

HTH

Duncan

I am an HPE Employee
Accept or Kudo
Eric Yruegas
Frequent Advisor

Re: Failed CPUs, nPars and vPars

Perhaps it's just a string of bad luck in my datacenter then... pretty frustrating though to see what may be our fourth failed Integrity CPU in a two-week time span!

It's both our Montvales and Monticetos - so I can't really blame a particular CPU.

Guess I'll just grin and bear it. And tell my management to do the same. :-)

Thanks!