SAP HA SUSE AWS troubleshooting

In this article would like to describe detailed steps to troubleshoot HA related issues for SAP Applications and Databases.

The application/DB in Highly Available setup will be mapped across multiple servers called Cluster.

In our environment we have 2 node clusters. These 2 nodes are in different AZ’s , ensuring that in case of AZ issue we have application moved over to other AZ.

Below we will discuss about the SUSE cluster setup:

The cluster has 3 resources to it-

1- Stonith: This resource is capable of shutting down the other node in case cluster senses a split brain situation . By shutting down other node cluster will ensure that data is available at one server and data corruption does not happen.

A split-brain condition is a corrective response to a single node failure of 2-node clusters, in which the high-availability pair is no longer able to communicate with each other, but still provides limited functionality.

2- IP resource: This resource is bound to app resource and acts as a gateway to reach to app . This will be the Virtual IP , and should be outside of VPC’s n/w range.

3- App resource: This resource will monitor app of server where it is Online. In case an issue is identified that may render application useless. This resource will trigger a failover of app, will try to do a clean shutdown of app, if not it will contact stonith and shutdown the node, Once confirmed, this resource will bring the app online on other node.

This is how a healthy cluster should look like:

Refer below- IP resource should be running on same instance that is running in master mode in DB.

suse01:~ # crm status

Stack: corosync

Current DC: suse01 (version 1.1.16-6.5.1-77ea74d) – partition with quorum

Last updated: Tue Oct 23 13:20:46 2018

Last change: Tue Oct 23 13:19:52 2018 by root via crm_attribute on suse01

2 nodes configured

6 resources configured

Online: [ suse01 suse02 ]

Full list of resources:

res_AWS_STONITH (stonith:external/ec2): Started suse01

res_AWS_IP (ocf::suse:aws-vpc-move-ip): Started suse01 —— this is the IP res, it is running on suse01

Clone Set: cln_SAPHanaTopology_HA1_HDB10 [rsc_SAPHanaTopology_HA1_HDB10]

Started: [ suse01 suse02 ]

Master/Slave Set: msl_SAPHana_HA1_HDB10 [rsc_SAPHana_HA1_HDB10]

Masters: [ suse01 ] —— the Master Res, is running on suse01

Slaves: [ suse02 ]

The cluster is configured in semi-automatic mode for DB . In case of an event, the cluster will identify the fault and will trigger a failover/takeover.

The nodes of the HAE Cluster will monitor each other. They will shut down unresponsive or misbehaving nodes prior to any failover actions in order to prevent data corruption. Setting the AWS stonith-action to power-off will permanently shut down the defect cluster node. This will expedite a takeover on AWS.

The default setting reboot will make the STONITH agent wait until a reboot will have been successfully completed. This will delay the reconfiguration of the SAP HANA database. Re-inte-grating a faulty cluster node into the cluster has to be performed manually since it’ll take an investigation why the cluster node didn’t operate as expected.

Restarting the second (faulty) cluster node automatically can be configured as well. It bears however the risk that the remaining node gets harmed trough an incorrect acting second (faulty) node. The reconfiguration of the second (faulty) node happens through the following steps:

1. Restart node through the AWS console

2. Investigate the node after reboot and fix a potential defect

skip step 3 , 4 and 6 if cluster is ASCS/ERS .

3. Boot SAP HANA manually. Check the instance health. Fix a potential defect. Shut down SAP HANA.

4. Configure SAP HANA to be a secondary node to the new master node.

node 002 is hosting primary DB now.

# hdbnsutil -sr_register –remoteHost=002 –remoteInstance=00 –replicationMode=syncmem –name=AZ1

5. Restart the HAE cluster with the command “systemctl start pacemaker” as super user. This process can take several minutes.

6. Start the failed services by cleaning failed resources

Log in to 001 as root and execute following command

# crm resource cleanup rsc_SAPHana_UP1_HDB00 001

This command will start HDB on 001 after the cleanup of the failed resource on AZ1 site.

7. Verify that all cluster services operate correctly. A takeover is now completed. The roles of the two cluster nodes have been flipped. The SAP HANA database is now protected against future failure events.

The cluster is configured to work automatically without any intervention, However there could be issue that cluster resources stonith and ip are not starting, during such issues validate that correct role and policies are attached.

Check for any recent changes that might trigger this.

Putting cluster into maintenance mode and cleaning faulty resource

disabling cluster into maintenance mode will be needed in case we want to perform maintenance activities on application / OS.

For any scheduled down time or maintenance work on SAP system such as change to SAP profile parameter, HANA parameters, sap kernel changes, SAP support pack application or HANA database upgrades, cluster should be in maintenance mode to avoid unintended failovers.

7.1 To enable cluster maintenance mode execute following command as root user on any of the cluster nodes

# crm configure property maintenance-mode=”true”

Once the above command is executed, the cluster resources are shown as unmanaged.