Failed CPUs, nPars and vPars

Eric Yruegas · ‎06-01-2009

General question for you all before I potentially stick my foot in my mouth and raise a stink over an issue I'm seeing.

I've got about a dozen rx8640's, all running HPUX11iv2, and all are running vPars - anywhere from 4 to 8 vPars per 8640. Nothing out of the ordinary there.

Lately I've had a string of failed CPUs, and each time it's caused an HPMC and the whole nPar gets reset, which of course pulls the rug out from under all the running vPars. Ugh.

Unless I'm way off base - servers such as the 8640 are built with lots of fault-tolerance. They can 'handle' a failed CPU, DIMM, etc., without too much fuss. Or at least that's how it's supposed to work. I don't recall my PA-RISC machines rebooting upon detecting a failed CPU in the past. Even the technician who's been replacing the CPUs has told me that a failed CPU shouldn't be causing the whole server to reset.

Has anyone else seen this behavior? We're running vPars A.04.04.04, for what it's worth.

I had another failure over the weekend, and I'm waiting on Support to go over the chassis logs. If it's another CPU I'm going to elevate... I really don't think this should be happening.

Am I way off base in my thinking? Feedback appreciated.

Duncan Edmonstone · ‎06-01-2009

Eric,

Sorry, but you are wrong here...

We absolutely had HPMCs on PA-RISC systems and they would absolutely always result in the rebooting of a whole nPar. All HP-UX boxes have always had something called "Dynamic Processor Resiliency", which can detect a "failing" CPU and take it offline, but you have to "detect" that failure... some CPU failures can be caught and flagged (from which we can do a DPR deactivation), and some can't - seems like you've been getting unlucky to me.

The problem is that some folks read the "DPR" material and interperet it as "HP UNIX servers can withstand CPU failures *under all conditions*", and that isn't and won't be the case. The same is true for Sun UNIX systems, IBM UNIX systems, in fatc pretty much every "open system" that's out there in the market. There are exceptions like the HP NonStop server, but they get round the issue by "lock stepping" the same instrauctions through 2 CPUs for resiliency.

And if the technician who's swapping bthe CPUs (presumably a HP employee) thinks that HP UNIX servers can catch all CPU failures without a reboot, he needs to speak with some of the internal labs folks to get that straight.

HTH

Duncan

I am an HPE Employee

Eric Yruegas · ‎06-01-2009

Perhaps it's just a string of bad luck in my datacenter then... pretty frustrating though to see what may be our fourth failed Integrity CPU in a two-week time span!

It's both our Montvales and Monticetos - so I can't really blame a particular CPU.

Guess I'll just grin and bear it. And tell my management to do the same. :-)

Thanks!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Failed CPUs, nPars and vPars

Failed CPUs, nPars and vPars

Re: Failed CPUs, nPars and vPars

Re: Failed CPUs, nPars and vPars