High availability and disaster recovery with HPE Serviceguard Extension for SAP HANA

ComputeExperts · ‎10-19-2020

Let's get technical and learn about HPE Serviceguard for Linux and what makes this the ideal high availability and disaster recovery solution for any customer deploying SAP HANA as a mission-critical database.

In the past ten years, SAP HANA has become a defacto database running behind most SAP Applications. Customers see SAP HANA as a critical component in their SAP Applications landscape. That's why it important that every possible care is taken to make sure it is highly available – and in case it's down, it can be brought up in the shortest amount of time without any loss of data. Typically automatic cluster management for SAP HANA database, seamless management of large number of database instances, or containers become important. What's more, customers need to consider host-based system replication from primary site to a secondary site.

High availability as a concept means deploying an application, a database or a component in a redundant way to make sure that even if a component is down, a standby can take over in a very short time – with no noticeable downtime. This is where HPE Serviceguard for Linux (SGLX) comes in.

HPE Serviceguard for Linux (SGLX)

HPE Serviceguard for Linux (SGLX) is a clustering software that can seamlessly manage Linux clusters and with least amount of administrator intervention. It is an application high availability (HA) and disaster recovery (DR) clustering solution that increases up-time for mission-critical applications by protecting them from a multitude of infrastructure and application faults across physical or virtual environments over any distance. It reduces the impact of unplanned downtime with no compromise on data integrity and performance. It also helps achieve near-zero planned downtime for maintenance.

Now consider SGLX in context of SAP HANA with a focus on Smart Quorum Server and HPE Serviceguard for Linux extension for SAP HANA (SGeSAP).

Smart Quorum Server

The Serviceguard Quorum Server provides arbitration services for Serviceguard clusters when a cluster partition is discovered. The Serviceguard Smart Quorum is HANA-aware and makes sure that the primary HANA system continues to remain operational when it gets cut off from its secondary. The Smart Quorum values HANA primaries over secondaries. It is thus superior to common majority quorum mechanisms. Smart Quorum enables configurations of scale-out clusters that co-exist with one or more HANA standby nodes per site.

HPE Serviceguard for Linux extension for SAP HANA (SGeSAP)

HPE Serviceguard for Linux extension for SAP HANA extends the HPE Serviceguard for Linux to protect SAP HANA databases from different kinds of failures. It provides fully automated Disaster recovery and replication monitoring and management for SAP HANA. HPE Serviceguard works at Node, OS, Database, and Application level.

Key Serviceguard functionalities towards Automatic HA/DR include:

Automated role reversal between primary and secondary
Automatic change in replication direction
Automatic start of replication on primary recovery
Automated failover only happens if the HANA secondary is in sync, this is ensured by HPE Safesync
Automated failover only happens if the HANA secondary did not run out of sync far beyond customizable thresholds for time or changed data amount, this is ensured by HPE backlog checks

HPE Serviceguard also comes with a wide range of supported features that cover the complete infrastructure stack at hardware, OS, database and application level.

Support for up to 32 nodes in a scale-out database
Support for HANA pmem configurations
Supported on both SLES and RHEL operating systems
Support for both scale up and scale-out SAP HANA databases
Support for multi-tenant database, including monitoring of replication status of all tenants and automated instance takeover for any instance service failure
Support for HANA replication modes i.e. synchronous, syncmem or asynchronous
Support for dual purpose storage configuration
Support of clustered scale-up and scale-out HANA XSA setups
Support of additional multi-target and multi-tier replication instances outside of the cluster
Minimized downtime after an outage due to the unique Serviceguard Quarantine function that often only adds seconds to the HANA-internal takeover time to ensure reliable failure detection, split-brain prevention and I/O-fencing
Smart recovery without impacting data integrity
Co-existence of HANA clustering and Netweaver clustering within the same cluster
HA/DR for SAP Netweaver application servers

Failure scenarios

Here is an explanation of common failure scenarios and how Serviceguard with SGeSAP responds.

Database failure – In this case there is a primary SAP HANA database running at the PROD site and it is being replicated to another SAP HANA database at the NEAR DR site. Once HPE Serviceguard notices a primary database service failure at the PROD site, it brings up the secondary database from the NEAR DR site, and automatically promotes it to be the new primary database. This helps to keep the application running without any downtime.

Once the database at the PROD site comes up, replication is started from the NEAR DR site to the PROD site, and once both the databases are in sync again, the failback is done automatically. The original primary database at the PROD site is promoted to become primary again and the database at the NEAR DR site is demoted to secondary. The direction of replication is again reversed from the original primary to secondary.

Primary node failure – In this case there is a primary SAP HANA data base running at the PROD site and it is being replicated to another SAP HANA database at the NEAR DR site. If the primary node fails, Serviceguard gracefully shutsdown the primary database and promotes the Secondary database running at the NEAR DR site. Once the original Primary running at the PROD comes up online it can join the cluster again and replication can start again from the new primary at NEAR DR site to the database at PROD site. Once in sync, Serviceguard can automatically reverse their roles to restore the original status before the failover happened. The difference between primary node failure and database failure is that in the database failure the OS is still running, but in case of node failure the OS itself is down.

Double failure handling – As mentioned earlier HPE Serviceguard monitors SAP HANA System Replication and if it notices that System Replication has failed, it marks the secondary node as ineligible for promotion to primary. In this situation if there is another failure at the primary node i.e. database down, then Serviceguard does not failover to the secondary node since it is ineligible for promotion. Serviceguard now tries to restart the database at the primary node itself.

Once the primary and secondary are up again, Serviceguard restarts SAP HANA system replication and only after primary and secondary are in sync again, the secondary is again turned on to be eligible for promotion to primary.

Dual purpose set up failure – SAP allows the secondary in a replication environment to be used for non-production databases. In this case if the primary database or node fails, Serviceguard shuts down the non-production database on the DR SITE and promotes the secondary database to primary. Once the Original primary database comes up at the PROD site it becomes a secondary in the Serviceguard cluster. Serviceguard now reinitiates the replication from the NEAR DR to PROD site. Once the primary and secondary are in sync again, Serviceguard again reverses their roles and the direction of replication.

Split-brain failures – Split brain is a typical situation in clusters wherein a cluster has been fragmented and there are more than one surviving segments requesting Quorum. HPE Serviceguard deploys a patented Smart Quorum technology to handle such situations. Serviceguard grants the quorum to the requestor that has the preferred workload running and denies the quorum to the secondary or DR site.

Multi-tenant database container (MDC) tenant failure – SAP HANA can host multiple tenant databases within a SID, HPE Serviceguard monitors replication status of all tenants separately. If there is a replication failure on a tenant in a multi-tenant database, Serviceguard triggers takeover of all tenants at the PROD site in the NEAR DR site automatically. The NEAR DR site gets promoted to primary and the PROD site is demoted to secondary. The replication direction is also reversed from NEAR DR to PROD.

A quick summary of some other useful features

Serviceguard Quarantine – Serviceguard can be configured to cut off a primary production HANA instance from its networks immediately after a failure has been detected, such as a HANA indexserver process abortion. Using Quarantine, a failover can trigger remote continuation of production faster than a local restart would be possible. Production use can continue in parallel while the failing instance is still being halted and cleaned up by Serviceguard on the original production hardware. Failover is faster and more reliable due to Quarantine. The failing instance gets “not frozen” in an undefined state, but shuts down regularly or gets cleaned out by Serviceguard. Its host(s) get automatically converted by Serviceguard to serve as working secondary HANA instance. This automatically creates a new failover target as soon as possible. Server reboots are not necessary in case of software failures.

Database connection suspension – Serviceguard has a feature to suspend active database connections from SAP Netweaver application servers. Using this feature Serviceguard requests SAP Netweaver application servers to temporarily suspend their database connections for maintenance purpose. Once the maintenance activity has been completed Serviceguard sends resume instructions to Netweaver application servers. This helps to prevent aborting of ongoing work due to planned downtimes.

Serviceguard Safesync blocks – To explain this, let us assume that at first there is a database failure at the secondary site or a replication network outage between primary and secondary. After a short delay, even a primary instance in HANA sync replication mode will continue to operate. It commits additional data changes to the application layer that are not reflected in the secondary data. Serviceguard will subsequently make sure that neither an automated failover nor a cluster or server reboot situation or an administration human error triggers such outdated secondary database contents to serve as production system data. Without Safesync, a failover can easily lead to silent data loss.

Advanced monitoring – The solution brings together proven and fast, generic Serviceguard hardware monitoring components to track health of servers, disks and link-level networking components. It adds operating system process monitoring, regular monitoring of SAP reported internal states and catches SAP-triggered HANA indexserver crash event hooks to also reliably detect many common software issues. A new functionality allows to do SQL probing to validate responsiveness and detect stops responding of any indexserver process.

In conclusion

HPE Serviceguard for Linux can protect and recover a SAP HANA database from all kinds of potential failures that can occur in a large environment automatically – and without compromising data integrity. This is an ideal high availability and disaster recovery solution for any customer deploying SAP HANA as a mission-critical database.

To learn more about HPE Serviceguard for Linux, watch the demo - and also check ou the HPE Reference Architectures.

Meet the bloggers

Markus Ertl is a master product architect, working onsite at SAP headquarters in Germany. For close to a quarter of a century, he had driven the development of value-add software solutions for HPE in various SAP contexts – and he designs the SAP-related feature set for HPE Serviceguard.

Sanjay has two decades of experience in the IT industry. Currently, he is a Solutions Architect with the HPE Solution Engineering Team – SAP HANA, which is focused on creating solutions and reference architectures for enterprise use cases around SAP and/or SAP HANA technologies.

Compute Experts
Hewlett Packard Enterprise

twitter.com/HPE_Servers
linkedin.com/showcase/hpe-servers-and-systems/
hpe.com/servers

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

High availability and disaster recovery with HPE Serviceguard Extension for SAP HANA

ComputeExperts

Author

Kudos