NonStop Servers
1748269 Members
3580 Online
108760 Solutions
New Discussion юеВ

How does HP/Tandem NonStop achieve single failure FT without spares?

 
donald_78
Occasional Visitor

How does HP/Tandem NonStop achieve single failure FT without spares?

As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!

 

2 REPLIES 2
techin
Super Advisor
sradomsky
Advisor

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

At a very high level,   NonStop servers use all the components all the time, without the use of spares,  by employing software that will "failover"  a failed component to a "Backup" component.   All the parts are in use all the time, but you must build in a small amount of excess capacity to accomodate failover in the case of a failure.   This concept is used for Processors,  Controllers,  Disks,  Network adapters, and the system bus.    There is a concept of Alive messages which are generated by components and monitored by software to enable fast failover in the case of a failure.   This is accomplished by a message based operating system,  where the OS can redirect messages based on the current state of any component.   It is a little more complicated for software components, called processes, which can also fail over (if coded as NonStop)  or be recreated quickly in the case of a failure (context free).