Servers & Systems: The Right Compute
ComputeExperts

High availability and disaster recovery with HPE Serviceguard Extension for SAP HANA

Let's get technical and learn about HPE Serviceguard for Linux and what makes this the ideal high availability and disaster recovery solution for any customer deploying SAP HANA as a mission-critical database. 

HPE Serviceguard-mission critical-blog.png

In the past ten years, SAP HANA has become a defacto database running behind most SAP Applications. Customers see SAP HANA as a critical component in their SAP Applications landscape. That's why it important that every possible care is taken to make sure it is highly available  and in case it's down, it can be brought up in the shortest amount of time without any loss of data. Typically automatic cluster management for SAP HANA database, seamless management of large number of database instances, or containers become important. What's more, customers need to consider host-based system replication from primary site to a secondary site.

High availability as a concept means deploying an application, a database or a component in a redundant way to make sure that even if a component is down, a standby can take over in a very short time  with no noticeable downtime. This is where HPE Serviceguard for Linux (SGLX) comes in.

HPE Serviceguard for Linux (SGLX)

HPE Serviceguard for Linux (SGLX) is a clustering software that can seamlessly manage Linux clusters and with least amount of administrator intervention. It is an application high availability (HA) and disaster recovery (DR) clustering solution that increases up-time for mission-critical applications by protecting them from a multitude of infrastructure and application faults across physical or virtual environments over any distance. It reduces the impact of unplanned downtime with no compromise on data integrity and performance. It also helps achieve near-zero planned downtime for maintenance.

Now consider SGLX in context of SAP HANA with a focus on Smart Quorum Server and HPE Serviceguard for Linux extension for SAP HANA (SGeSAP).

Smart Quorum Server

The Serviceguard Quorum Server provides arbitration services for Serviceguard clusters when a cluster partition is discovered. The Serviceguard Smart Quorum is HANA-aware and makes sure that the primary HANA system continues to remain operational when it gets cut off from its secondary. The Smart Quorum values HANA primaries over secondaries. It is thus superior to common majority quorum mechanisms. Smart Quorum enables configurations of scale-out clusters that co-exist with one or more HANA standby nodes per site.

HPE Serviceguard for Linux extension for SAP HANA (SGeSAP)

HPE Serviceguard for Linux extension for SAP HANA extends the HPE Serviceguard for Linux to protect SAP HANA databases from different kinds of failures. It provides fully automated Disaster recovery and replication monitoring and management for SAP HANA. HPE Serviceguard works at Node, OS, Database, and Application level.

Key Serviceguard functionalities towards Automatic HA/DR include: 

  • Automated role reversal between primary and secondary
  • Automatic change in replication direction
  • Automatic start of replication on primary recovery
  • Automated failover only happens if the HANA secondary is in sync, this is ensured by HPE Safesync
  • Automated failover only happens if the HANA secondary did not run out of sync far beyond customizable thresholds for time or changed data amount, this is ensured by HPE backlog checks

HPE Serviceguard also comes with a wide range of supported features that cover the complete infrastructure stack at hardware, OS, database and application level.

  • Support for up to 32 nodes in a scale-out database
  • Support for HANA pmem configurations
  • Supported on both SLES and RHEL operating systems
  • Support for both scale up and scale-out SAP HANA databases
  • Support for multi-tenant database, including monitoring of replication status of all tenants and automated instance takeover for any instance service failure
  • Support for HANA replication modes i.e. synchronous, syncmem or asynchronous
  • Support for dual purpose storage configuration
  • Support of clustered scale-up and scale-out HANA XSA setups
  • Support of additional multi-target and multi-tier replication instances outside of the cluster
  • Minimized downtime after an outage due to the unique Serviceguard Quarantine function that often only adds seconds to the HANA-internal takeover time to ensure reliable failure detection, split-brain prevention and I/O-fencing
  • Smart recovery without impacting data integrity
  • Co-existence of HANA clustering and Netweaver clustering within the same cluster
  • HA/DR for SAP Netweaver application servers

Failure scenarios

Here is an explanation of common failure scenarios and how Serviceguard with SGeSAP responds.

Database failure – In this case there is a primary SAP HANA database running at the PROD site and it is being replicated to another SAP HANA database at the NEAR DR site. Once HPE Serviceguard notices a primary database service failure at the PROD site, it brings up the secondary database from the NEAR DR site, and automatically promotes it to be the new primary database. This helps to keep the application running without any downtime.

HPE-Serviceguard1.png

Once the database at the PROD site comes up, replication is started from the NEAR DR site to the PROD site, and once both the databases are in sync again, the failback is done automatically. The original primary database at the PROD site is promoted to become primary again and the database at the NEAR DR site is demoted to secondary. The direction of replication is again reversed from the original primary to secondary.

Primary node failure – In this case there is a primary SAP HANA data base running at the PROD site and it is being replicated to another SAP HANA database at the NEAR DR site. If the primary node fails, Serviceguard gracefully shutsdown the primary database and promotes the Secondary database running at the NEAR DR site. Once the original Primary running at the PROD comes up online it can join the cluster again and replication can start again from the new primary at NEAR DR site to the database at PROD site. Once in sync, Serviceguard can automatically reverse their roles to restore the original status before the failover happened. The difference between primary node failure and database failure is that in the database failure the OS is still running, but in case of node failure the OS itself is down.

HPE-Serviceguard2.png

Double failure handling – As mentioned earlier HPE Serviceguard monitors SAP HANA System Replication and if it notices that System Replication has failed, it marks the secondary node as ineligible for promotion to primary. In this situation if there is another failure at the primary node i.e. database down, then Serviceguard does not failover to the secondary node since it is ineligible for promotion. Serviceguard now tries to restart the database at the primary node itself.

Once the primary and secondary are up again, Serviceguard restarts SAP HANA system replication and only after primary and secondary are in sync again, the secondary is again turned on to be eligible for promotion to primary.

HPE-Serviceguard3.png

Dual purpose set up failure – SAP allows the secondary in a replication environment to be used for non-production databases. In this case if the primary database or node fails, Serviceguard shuts down the non-production database on the DR SITE and promotes the secondary database to primary. Once the Original primary database comes up at the PROD site it becomes a secondary in the Serviceguard cluster. Serviceguard now reinitiates the replication from the NEAR DR to PROD site. Once the primary and secondary are in sync again, Serviceguard again reverses their roles and the direction of replication.

HPE-Serviceguard4.png

Split-brain failures – Split brain is a typical situation in clusters wherein a cluster has been fragmented and there are more than one surviving segments requesting Quorum. HPE Serviceguard deploys a patented Smart Quorum technology to handle such situations. Serviceguard grants the quorum to the requestor that has the preferred workload running and denies the quorum to the secondary or DR site.

HPE-Serviceguard5.png

Multi-tenant database container (MDC) tenant failure – SAP HANA can host multiple tenant databases within a SID, HPE Serviceguard monitors replication status of all tenants separately. If there is a replication failure on a tenant in a multi-tenant database, Serviceguard triggers takeover of all tenants at the PROD site in the NEAR DR site automatically. The NEAR DR site gets promoted to primary and the PROD site is demoted to secondary. The replication direction is also reversed from NEAR DR to PROD.

HPE-Serviceguard6.png

 A quick summary of some other useful features

Serviceguard Quarantine – Serviceguard can be configured to cut off a primary production HANA instance from its networks immediately after a failure has been detected, such as a HANA indexserver process abortion. Using Quarantine, a failover can trigger remote continuation of production faster than a local restart would be possible. Production use can continue in parallel while the failing instance is still being halted and cleaned up by Serviceguard on the original production hardware. Failover is faster and more reliable due to Quarantine. The failing instance gets “not frozen” in an undefined state, but shuts down regularly or gets cleaned out by Serviceguard. Its host(s) get automatically converted by Serviceguard to serve as working secondary HANA instance. This automatically creates a new failover target as soon as possible. Server reboots are not necessary in case of software failures.

Database connection suspension – Serviceguard has a feature to suspend active database connections from SAP Netweaver application servers. Using this feature Serviceguard requests SAP Netweaver application servers to temporarily suspend their database connections for maintenance purpose. Once the maintenance activity has been completed Serviceguard sends resume instructions to Netweaver application servers. This helps to prevent aborting of ongoing work due to planned downtimes.

Serviceguard Safesync blocks – To explain this, let us assume that at first there is a database failure at the secondary site or a replication network outage between primary and secondary. After a short delay, even a primary instance in HANA sync replication mode will continue to operate. It commits additional data changes to the application layer that are not reflected in the secondary data. Serviceguard will subsequently make sure that neither an automated failover nor a cluster or server reboot situation or an administration human error triggers such outdated secondary database contents to serve as production system data. Without Safesync, a failover can easily lead to silent data loss.

Advanced monitoring – The solution brings together proven and fast, generic Serviceguard hardware monitoring components to track health of servers, disks and link-level networking components. It adds operating system process monitoring, regular monitoring of SAP reported internal states and catches SAP-triggered HANA indexserver crash event hooks to also reliably detect many common software issues. A new functionality allows to do SQL probing to validate responsiveness and detect hangs of any indexserver process.

In conclusion

HPE Serviceguard for Linux can protect and recover a SAP HANA database from all kinds of potential failures that can occur in a large environment automatically – and without compromising data integrity. This is an ideal high availability and disaster recovery solution for any customer deploying SAP HANA as a mission-critical database.  

To learn more about HPE Serviceguard for Linux, watch the demo - and also check ou the HPE Reference Architectures.


Meet the bloggers

Marku Ertl-HPE.pngMarkus Ertl is a master product architect, working onsite at SAP headquarters in Germany. For close to a quarter of a century, he had driven the development of value-add software solutions for HPE in various SAP contexts – and he designs the SAP-related feature set for HPE Serviceguard.

 

Sanjay-HPE.pngSanjay has two decades of experience in the IT industry. Currently, he is a Solutions Architect with the HPE Solution Engineering Team – SAP HANA, which is focused on creating solutions and reference architectures for enterprise use cases around SAP and/or SAP HANA technologies.


Compute Experts
Hewlett Packard Enterprise

twitter.com/HPE_Servers
linkedin.com/showcase/hpe-servers-and-systems/
hpe.com/servers

 

 

About the Author

ComputeExperts

Our team of Hewlett Packard Enterprise server experts helps you to dive deep into relevant infrastructure topics.