HPE 9000 and HPE e3000 Servers
1751966 Members
4626 Online
108783 Solutions
New Discussion юеВ

system on line with a cpu failure

 
SOLVED
Go to solution
Kirk Reindl
Frequent Advisor

system on line with a cpu failure

I had a couple of general questions.

Let's say I'm running an rp7000 series server with HPUX 11.11 on it. Let's say it has 4 cpus. If anyone one of the cpus fails will the server continue to run without crashing??

I know the "old school" T520s will crash and de-configure the bad CPU. I wasn't sure if the newer generation servers worked differently.

Also, why does the server need to crash in the event of a cpu failure? Why can't it continue to work on the remaining cpus. Is it because applications are multi-threaded?

Is there any OS/Hardware that can continue to run if a CPU fails.

Thanks
Kirk
4 REPLIES 4
Steven E. Protter
Exalted Contributor

Re: system on line with a cpu failure

Yes it will continue to run. You will notice a performance hit. On a 4 CPU system you might not notice unless your load factor is high.

If you run top, you will see one less cpu. Same deal with glance.

EMS will not detect the problem and email you. This is the kind of thing you have to spot as a sysadmin.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Jeff Schussele
Honored Contributor

Re: system on line with a cpu failure

Hi Kirk,

Most, if not all, of the time the system will panic. This is deliberate because the system may not be able to determine the resources that the failed CPU had allocated & will now never release. In fact this is what directly causes the panic. The timer pops on another CPU because it times out waiting for the resource that the failed CPU is *never* going to release.
I suppose that if a completely idle CPU failed, then there would be no need for a panic - but I suspect this happens rarely. If it's doing nothing, why would it fail?
Plus the reboot give the system the opportunity to deallocate the CPU.

My 2 cents,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Pramod_4
Trusted Contributor

Re: system on line with a cpu failure

Kirk,

On HP servers with the processors PA8500 onwards, dynamic processor deallocation is possible. All processor except CPU0 (known as monarc processor) can be deallocated online with the product HP EMS and Support Tool Manager installed.

Once the defective processor is deallocated online, it will be deconfigured completely upon next system reboot.

There is a white paper on this at http://docs.hp.com/hpux/onlinedocs/diag/dynamic.pdf

Hope this help.

Pramod




Bill Hassell
Honored Contributor
Solution

Re: system on line with a cpu failure

There really isn't a simple CPU failure which is why it is almost impossible to provide a fail-soft condition for a multi-CPU system. A transistor fails in the instruction cache and the processor starts making mistakes. The kernel tasks are shared among the various processors so you can imagine the disaster that occurs when the kernel goes wacky. The Monarch processor distributes independent tasks but seldom does a failure in a CPU board occur in a way that a graceful deallocation can take place. Now if kernel tasks were not distributed, perhaps a processor failure would simply crash the application running on the bad processor...still not a good thing for databases.

To answer your question, there are some unique hardware/OS products that are extremely expensive which can survive the majority of processor/memory/backplane failures but for most data centers, it is far more economical to run failover clusters.


Bill Hassell, sysadmin