The unique set of RAS features in HPE Superdome Flex: How they work and why they matter

ComputeExperts · ‎05-17-2018

Learn how HPE Superdome Flex delivers five 9s system availability by incorporating extraordinary Reliability, Availability and Serviceability (RAS) capabilities that keep you up and running.

It’s no secret: downtime stopped being an option for the vast majority of industries many years ago. Increasingly, customers expect to get any product or service, anytime and anywhere. If your website is unresponsive, an application down, or a service unavailable, customers have plenty of other choices. So, no matter your business, when it comes to critical applications and databases, you’d better have a solid, reliable infrastructure behind them that minimizes the risk of downtime—and of your customers heading to the competition.

The HPE Superdome Flex story

In case you missed it, I recently wrote an in-depth post on the HPE Superdome Flex modular architecture and why it matters. It sets the foundation for the HPE Superdome Flex story I’m continuing to tell here.

In this second blog, I focus on the Reliability, Availability and Serviceability (RAS) capabilities of the HPE Superdome Flex. I’ll discuss how these capabilities set HPE Superdome Flex apart from other industry-standard servers and how they can help you stay up and running.

Keeping your workloads and data available even when faults occur: HPE’s Fault Management Strategy

When HPE sets out to design mission-critical servers, such as the HPE Superdome Flex, we follow a four-stage RAS strategy to (1) detect, (2) log, (3) analyze, and (4) repair any fault or error. This strategy keeps your workloads up and data available even in the presence of faults. Our design principles include minimizing repair time; diagnosing failures rapidly through capturing enough data; logging errors thoroughly so we can diagnose all the system components including software, firmware and hardware; and adding self-healing capabilities to the system.

It’s inescapable. In any IT environment, faults will occur. When it comes to infrastructure faults, the challenges you must address include:

How to establish what went wrong
How to prevent the fault reaching the higher levels of the IT stack, such as the operating system, database, application and, ultimately, the data
How to handle errors to minimize or eliminate unplanned and planned downtime

In the case of Superdome Flex, HPE has developed an extensive set of differentiated features to address these challenges. Here, I highlight just a few of the key features. For an extensive list and deeper-dive explanation of them all, please check out the HPE Superdome Flex Architecture and RAS Technical Whitepaper.

Now on to the details about what these RAS capabilities bring to your mission-critical server environment.

Establishing what went wrong: Error detection and system self-healing

HPE Superdome Flex delivers unique resiliency capabilities across every subsystem—memory, I/O, fabric and processor—to protect your critical workloads, detect errors and self-heal from a multitude of faults. HPE has instrumented the firmware and hardware inside the system at the lowest level to ensure that all the necessary evidence is collected to detect errors, as well as to determine root causes and correlations between errors.

Memory: Advanced memory resiliency (ADDDC): The industry standard for memory protection is single error correction and double error detection (SECDED) of data errors. Many servers on the market provide single device data correction, also known as chip sparing or chipkill. Single device data correction capabilities protect the system from any single-bit data errors within a memory device, but they will generally not protect the system if a DRAM has also failed. Though detected, it will cause a system to crash.

HPE Superdome Flex servers address this problem with Adaptive Double Device Data Correction (ADDDC) technology. ADDDC determines when the first DRAM in a rank has failed, corrects the data, and maps that DRAM out of use by moving its data to spare bits in the rank. Once this is done, single device data correction is still available for the corrected rank. Thus, a total of two entire DRAMs in a rank of dual in-line memory modules (DIMMs) can fail and memory is still protected, which is paramount for memory-intensive workloads like the ones commonly deployed on Superdome Flex. After a second defective DRAM is detected, the OS will shut down to prevent any data corruption. ADDDC makes the system essentially tolerant of a DRAM failure on every DIMM.

Compared to systems with only single-chip sparing, ADDDC drastically improves system uptime, as fewer failed DIMMs need to be replaced, and significantly reduces the chances of memory-related crashes. Although ADDDC is based upon an Intel® Xeon® processor E7 feature, HPE has enhanced the Intel base code on HPE Superdome Flex with specific firmware and hardware algorithms.

I/O: Advanced PCIe error recovery (LER): Uncorrectable errors in a server’s PCIe subsystem can potentially propagate to other components, resulting in a crash. To minimize this risk in HPE Superdome Flex servers, HPE developed specific firmware leveraging Intel’s Live Error Recovery (LER) mechanism to provide a way of trapping errors at a root port, therefore preventing error propagation. LER containment allows the platform to detect PCIe errors in the inbound and outbound PCIe path. When a PCIe error occurs, LER is able to contain the error by stopping I/O transfers to avoid corrupted data from reaching the network and/or permanent storage. LER containment also avoids propagation of the error and an immediate crash of the machine. In parallel, HPE firmware is informed and, in turn, the OS and upper layer device drivers are made aware of the error. This innovative Superdome Flex solution for Live Error Recovery is not available on typical Xeon-based systems

Fabric: Adaptive Routing: Another HPE Superdome Flex innovation, the interconnect scheme, provides adaptive routing that increases availability by dynamically routing traffic around a failed component, without requiring downtime. The system continues to run in the rare event of most fabric failures. This capability also boosts performance by routing traffic via the optimal latency path. Superdome Flex has been designed to achieve fault-tolerant fabric resiliency using high-bandwidth links providing multiple paths, and a packet-based transport layer that guarantees delivery of packets through the fabric.

Preventing faults from reaching the operating system and higher-level software: Firmware First

Firmware First is a key component of HPE’s Superdome comprehensive strategy for fault management. With Firmware First, firmware with detailed knowledge of the HPE Superdome Flex system is first on the scene when problems arise to quickly and accurately determine what’s wrong and how to fix it. Firmware First covers correctable and uncorrectable errors, and enables firmware to collect error data and diagnose faults, even when the system processors have limited functionality. This approach also enables many platform-specific actions in response to faults, including predictive fault analysis for system memory, CPU, I/O, and interconnect.

What makes it possible for firmware to play this role is the Enhanced Machine Check Architecture Gen 2 (EMCA2) present in the latest generation Intel Xeon Processor Scalable family, which HPE Superdome Flex uses. This architecture allows firmware a first look at error logs, so that the firmware itself can diagnose problems and take appropriate actions for the platform before the operating system and higher-level software are affected or involved. However, although the EMCA2 architecture is common to all platforms based on Intel Xeon Scalable processors, Firmware First is unique to HPE Superdome Flex and Superdome X servers.

Reporting and handling errors: Error Analysis Engine

The Error Analysis Engine constantly analyzes all hardware for faults. Based on detected errors, the Analysis Engine can predict failures and initiate automatic recovery actions as well as notify system administrators and management software such as HPE OneView and HPE Insight Remote Support. This best-in-class predictive fault handling technology initiates self-repair without operator assistance, reducing human error and therefore increasing availability.

Because our goal is to determine root cause of failure on-system and take action to immediately self-heal, we embedded the Analysis Engine directly inside the system, rather than rely on core dumps, memory dumps, and other log data for detailed analysis afterward. First priority is to repair while the system is still up; second priority is to repair it instantly during a reboot. Conventional x86 systems will detect an error and crash (reboot), then the BIOS will run a power-on self test and likely miss what was really wrong with the system. With the Analysis Engine, HPE Superdome Flex detects the error and analyzes it when it happens. It will then mark the failed component faulty, and possibly electrically remove the component. Thus the faulty component is now permanently known (and in many cases removed from operation) while the system is running or after a reboot. Lacking this capability, a typical x86 system, in response to the same fault, is likely to leave the component running in the system, where it will likely fail again, possibly causing data loss or leaving the system completely dead until it is physically repaired. Superdome Flex’s ability to self-heal means the faulty component is not being used, your system is back running again, and you can plan downtime at your convenience to repair the system.

As I mentioned, this is just a small sample of the advanced RAS features present in HPE Superdome Flex. We also offer the leading clustering, high-availability and disaster recovery software suite, HPE Serviceguard for Linux, and extensive HPE Pointnext services to protect against system and site outages. We have committed partnerships with other technology vendors, such as SAP, Microsoft, Intel, Red Hat, SUSE, and more, to help ensure your entire ecosystem works well together and delivers the availability your critical workloads need.

So why does this matter?

In my previous blog on Superdome Flex architecture, I talked about a world of ever-growing data and how, in supporting the business, IT teams need systems that respond effectively and promptly to their requests, regardless of the quantity of data or how fast it grows. But this is not all. In truth, companies face a complex dilemma—handling these unprecedented data flows while, at the same time, maintaining business continuity and the agility to respond quickly and efficiently to business change.

One certainty is that the business never stops. The customer, supplier, or facility is “always-on,” and a business built on real-time data suffers with any hiccup. Infrastructure reliability and end-to-end security are vital when operating in an environment where business continuity is expected. In any ecosystem, all the elements need to work in concert to deliver the expected result.

When it comes to IT ecosystems, everything starts at the component and infrastructure level. So if you want to achieve business continuity for your critical workloads, the reliability of your infrastructure is paramount. That’s why, with HPE Superdome Flex, you can rest assured that your environment is standing on solid ground, delivering must-have availability and reliability.

In my next blog, I will go over the performance benefits of HPE Superdome Flex and how they can help add agility and speed to your environment. In the meantime, learn more about our mission-critical x86 solutions.

Meet Servers: The Right Compute Blogger Diana Cortes, Marketing Manager, Mission Critical x86 Solutions, HPE.

Diana has spent the past 20 years working with the technologies that power the world’s most demanding environments and is interested in how solutions based on those technologies impact the business. A native from Colombia, Diana holds an MBA from Georgetown University and has held a variety of regional and global roles with HPE in the US, the UK and Sweden.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

The unique set of RAS features in HPE Superdome Flex: How they work and why they matter

ComputeExperts

Author

Kudos