- Community Home
- >
- Servers and Operating Systems
- >
- HPE ProLiant
- >
- Servers - General
- >
- How does HP/Tandem NonStop achieve single failure ...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-20-2022 02:57 AM - last edited on тАО04-21-2022 10:11 AM by Parvez_Admin
тАО04-20-2022 02:57 AM - last edited on тАО04-21-2022 10:11 AM by Parvez_Admin
As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).
This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.
Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?
Thanks a lot!
Solved! Go to Solution.
- Tags:
- Prolaint server
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-20-2022 08:28 AM
тАО04-20-2022 08:28 AM
Re: How does HP/Tandem NonStop achieve single failure FT without spares?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-21-2022 08:08 AM
тАО04-21-2022 08:08 AM
Re: How does HP/Tandem NonStop achieve single failure FT without spares?
A very interesting question. NonStop was developed using a massively parallel architecture rather than what is usually done for availablity - a cluster. So NonStop is more of an N+1 architecture. It was also decided that NonStop would not have spare parts, not doing anything until a failure. We wanted all parts of the system active so that during a failure we would be migrating work to a known good part, if you will. So in practice how does that work? The NonStop system can have anywhere from 2 to 16 servers within what we call a 'system'. We can expand beyond 16 but that's another topic. Let's say you have 4 server system. I guess quickly I should say a server or what we refer to as a logical CPU is currently a DL360 running either 2, 4 or 6 cores. So in my example we have 4 of these running as a single NonStop system. In a failure, one of the DL360's fails, work is magically if you will (our cool Operating System) automatically shifts the workloads that were running in the failed DL360 to the other 3. All this without users even knowing anything failed. Some very cool Operating System intellectual property. In practice this would mean that you should not run any individual processor (in a 4 server system) more than 75% so that the 75% workload that was running in the failed DL360 can fit in the remaining 3 DL360;s that are still running. Hopefully that makes sense. Ping me if it's not clear. So NonStop needs 'headroom' to accomodate a failed load - if you have a 2 processor, don't run more than a 50% CPU busy rate on either DL360. If you have a 16 server system you can run them pretty hot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-21-2022 08:21 AM
тАО04-21-2022 08:21 AM
SolutionI'll extend my asnswer a bit since you did ask 'how' NonStop knows something failed. The NonStop operating system is a very gossipy OS. In my last example of 4 DL360's running a NonStop 'system', it is viewed and managed as a single system (appears to be one NonStop) but there are 4 individual systems running the NonStop OS each independent of each other. There isn't anything shared - disk, memory each is independent. We have a dual shared fabric, currently Infiniband and that's 2 fabrics and we use both and they are designed so that either one can support the full load, communication-wise. These individuals servers ping each other and themselves about every second. If one of the DL360's in the mix is not heard from in 2 cycles (2 seconds) it is declared down by the other 3 and the workload shifted to the survivors. There's a lot to NonStop but hopefully this answers the general questions you had.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-25-2022 10:03 AM
тАО04-25-2022 10:03 AM
Re: How does HP/Tandem NonStop achieve single failure FT without spares?
Glad to know that the issue has been resolved. Thank you
Thanks,
Parvez_Admin
I work for HPE
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]