How does HP/Tandem NonStop achieve single failure FT without spares?

donald_78 · ‎08-03-2021

As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!

techin · ‎08-03-2021

I was able to get some technical papers..not sure if it'll help

https://www.hpe.com/us/en/pdfViewer.html?docId=4aa6-5326&parentPage=/us/en/products/servers/mission-critical-servers/integrity-nonstop-systems&resourceTitle=Delivering+business+continuity+for+vital+applications+best+practices+guide

https://www.hpe.com/us/en/servers/nonstop.html

sradomsky · ‎08-11-2021

At a very high level, NonStop servers use all the components all the time, without the use of spares, by employing software that will "failover" a failed component to a "Backup" component. All the parts are in use all the time, but you must build in a small amount of excess capacity to accomodate failover in the case of a failure. This concept is used for Processors, Controllers, Disks, Network adapters, and the system bus. There is a concept of Alive messages which are generated by components and monitored by software to enable fast failover in the case of a failure. This is accomplished by a message based operating system, where the OS can redirect messages based on the current state of any component. It is a little more complicated for software components, called processes, which can also fail over (if coded as NonStop) or be recreated quickly in the case of a failure (context free).

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How does HP/Tandem NonStop achieve single failure FT without spares?

How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?