Servers - General
1822949 Members
3925 Online
109645 Solutions
New Discussion юеВ

How does HP/Tandem NonStop achieve single failure FT without spares?

 
SOLVED
Go to solution
Hightower444
Occasional Contributor

How does HP/Tandem NonStop achieve single failure FT without spares?

As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!

4 REPLIES 4
techin
Valued Contributor

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

jjsim
HPE Pro

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

A very interesting question.  NonStop was developed using a massively parallel architecture rather than what is usually done for availablity - a cluster. So NonStop is more of an N+1  architecture.  It was also decided that NonStop would not have spare parts, not doing anything until a failure.  We wanted all parts of the system active so that during a failure we would be migrating work to a known good part, if you will.  So in practice how does that work?  The NonStop system can have anywhere from 2 to 16 servers within what we call a 'system'.  We can expand beyond 16 but that's another topic.  Let's say you have 4 server system.  I guess quickly I should say a server or what we refer to as a logical CPU is currently a DL360 running either 2, 4 or 6 cores.  So in my example we have 4 of these running as a single NonStop system.  In a failure, one of the DL360's fails, work is magically if you will (our cool Operating System) automatically shifts the workloads that were running in the failed DL360 to the other 3.  All this without users even knowing anything failed.  Some very cool Operating System intellectual property.  In practice this would mean that you should not run any individual processor (in a 4 server system) more than 75% so that the 75% workload that was running in the failed DL360 can fit in the remaining 3 DL360;s that are still running.  Hopefully that makes sense.  Ping me if it's not clear.  So NonStop needs 'headroom' to accomodate a failed load - if you have a 2 processor, don't run more than a 50% CPU busy rate on either DL360.  If you have a 16 server system you can run them pretty hot.

jjsim
HPE Pro
Solution

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

I'll extend my asnswer a bit since you did ask 'how' NonStop knows something failed.  The NonStop operating system is a very gossipy OS.  In my last example of 4 DL360's running a NonStop 'system', it is viewed and managed as a single system (appears to be one NonStop) but there are 4 individual systems running the NonStop OS each independent of each other.  There isn't anything shared - disk, memory each is independent.  We have a dual shared fabric, currently Infiniband and that's 2 fabrics and we use both and they are designed so that either one can support the full load, communication-wise.  These individuals servers ping each other and themselves about every second.  If one of the DL360's in the mix is not heard from in 2 cycles (2 seconds) it is declared down by the other 3 and the workload shifted to the survivors.  There's a lot to NonStop but hopefully this answers the general questions you had.

Parvez_Admin
Community Manager

Re: How does HP/Tandem NonStop achieve single failure FT without spares?

@Hightower444 ,

Glad to know that the issue has been resolved. Thank you


Thanks,
Parvez_Admin
I work for HPE
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
CM_Cert_Logo_Color.png