How does HP/Tandem NonStop achieve single failure FT without spares?

Hightower444 · ‎04-20-2022

As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!

techin · ‎04-20-2022

Check these:

https://support.hpe.com/hpesc/public/docDisplay?docId=a00099017en_us&docLocale=en_US

https://support.hpe.com/hpesc/public/docDisplay?docId=a00045838en_us&docLocale=en_US

jjsim · ‎04-21-2022

A very interesting question. NonStop was developed using a massively parallel architecture rather than what is usually done for availablity - a cluster. So NonStop is more of an N+1 architecture. It was also decided that NonStop would not have spare parts, not doing anything until a failure. We wanted all parts of the system active so that during a failure we would be migrating work to a known good part, if you will. So in practice how does that work? The NonStop system can have anywhere from 2 to 16 servers within what we call a 'system'. We can expand beyond 16 but that's another topic. Let's say you have 4 server system. I guess quickly I should say a server or what we refer to as a logical CPU is currently a DL360 running either 2, 4 or 6 cores. So in my example we have 4 of these running as a single NonStop system. In a failure, one of the DL360's fails, work is magically if you will (our cool Operating System) automatically shifts the workloads that were running in the failed DL360 to the other 3. All this without users even knowing anything failed. Some very cool Operating System intellectual property. In practice this would mean that you should not run any individual processor (in a 4 server system) more than 75% so that the 75% workload that was running in the failed DL360 can fit in the remaining 3 DL360;s that are still running. Hopefully that makes sense. Ping me if it's not clear. So NonStop needs 'headroom' to accomodate a failed load - if you have a 2 processor, don't run more than a 50% CPU busy rate on either DL360. If you have a 16 server system you can run them pretty hot.

jjsim · ‎04-21-2022

I'll extend my asnswer a bit since you did ask 'how' NonStop knows something failed. The NonStop operating system is a very gossipy OS. In my last example of 4 DL360's running a NonStop 'system', it is viewed and managed as a single system (appears to be one NonStop) but there are 4 individual systems running the NonStop OS each independent of each other. There isn't anything shared - disk, memory each is independent. We have a dual shared fabric, currently Infiniband and that's 2 fabrics and we use both and they are designed so that either one can support the full load, communication-wise. These individuals servers ping each other and themselves about every second. If one of the DL360's in the mix is not heard from in 2 cycles (2 seconds) it is declared down by the other 3 and the workload shifted to the survivors. There's a lot to NonStop but hopefully this answers the general questions you had.

Parvez_Admin · ‎04-25-2022

@Hightower444 ,

Glad to know that the issue has been resolved. Thank you

Thanks,
Parvez_Admin
I work for HPE
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How does HP/Tandem NonStop achieve single failure FT without spares?

How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

Re: How does HP/Tandem NonStop achieve single failure FT without spares?