Showing results for 
Search instead for 
Do you mean 

Fault tolerance

Trusted Contributor

Fault tolerance

I have been asked to analyse our HP servers for fault tolerance.

We have mirrored rood disks, RAID arrays, have network redundancy but I am wondering what other possable faults could be a single point of failure and what tools there may be to allow for redundancy?

All advice much appreciated.
Thanks
7 REPLIES

Re: Fault tolerance

Well there are other SPOF's such as the application, the systems themselves, the networking infrastructure just to name a few.
There are various products methodologies to reduce failures/downtimes, but be wary of using the term Fault Tolerant.

One product from HP is MC/ServiceGuard which is a High Availability product, and you can take a look at some info here:
http://h30046.www3.hp.com/solutions/solutionlist.php?topiccode=INFRAHAINDEX&regioncode=NA&langcode=USENG
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Trusted Contributor

Re: Fault tolerance

Thanks,

I really looking for single points of failure at the moment such as a SCSI adapter failing.

It is an N-Class server with 2 cpu's and if one fails will the server still operate on the the other one?

Re: Fault tolerance

no, the system will fail, if one cpu hits a major hardware error. This may result in simply having a panic, and the system rebooting ok, or it never managing to boot after the failure, until it is replaced.
All of the other boards are generically SPOF's as well.
This is one of the major components that Sg is designed to cover in a HA cluster.
You may also want to read the manuals at:
http://docs.hp.com/hpux/ha and select ServiceGuard
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Honored Contributor

Re: Fault tolerance

Generally, a CPU failure is catastrophic and the system will crash, usually because the failure is due to a shared resource (memory cache). There are a limited number of failure modes that can take a processor out of service but it would be quite difficult to quantify the effectiveness. HP-UX will run on less processors but the hardware failure mode (LPMC versus HPMC) determines whether the primary CPU continues. NOTE: the monarch or primary CPU cannot fail while HP-UX is running.

Disk and LAN card failures are tricky to quantify too. In the simplest failure mode (detected and disabled), the OS can handle load balancing within limits, but if a failure occurs on the backplane side, it may disrupt all cards on that backplane or hang the system. So the nature of the failure has more to do with success than redundancy.

That's why a Service Guard environment makes more sense. A failure that takes the entire CPU down is automatically handled by a separate system through monitoring, something that a single system cannot do.


Bill Hassell, sysadmin
Honored Contributor

Re: Fault tolerance

power supply - bet they're all going into the same power bar on the same rack.

Don't forget, disaster failover - are your backups on a remote site?
It works for me (tm)
Honored Contributor

Re: Fault tolerance

Hi

My fisrt choice is
HP's MC/Service guard and second is having local and remote backup
never give up
Honored Contributor

Re: Fault tolerance

Look at power - how's the power fed to the panel, how's it fed to PDU's, how's it fed to the equipment, look at your UPS for failure modes, got diesel?

Look at HVAC - do you have dual units, are they powered by the diesel?

If you have mirrored drives but you don't have multiple buses (e.g. the D-class boxen), a SCSI card failure will take down the box.

//Add this to "OnDomLoad" event