Operating System - HP-UX
1826332 Members
3464 Online
109692 Solutions
New Discussion

help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

 
Stefan FLOREA
Occasional Advisor

help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

We have a MCSG cluster 11.14 running SAp 46D with Oracle 817. We have to test the CI package failover when the net goes down. Here comes the problem:

sapci.cntl stop script calls the test_db function which in turn executes R3trans to check the connection to the database. But the network is down, so it cannot reach the database package which runs on the other node.
Instead of returning and error code R3trans gets stuck and the halt script never exits.
So, when the network is down, the CI package does not failover!

As the manual states, the transport directory is NFS exported from the db package.
Somehow, it seems that R3trans tries to use /usr/sap/trans but, again, the network is down and no can do.

Any hint will be highly appreciated.

thanks,
stef
9 REPLIES 9
melvyn burnard
Honored Contributor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

I suggest you take a read of the manual s for SGeSAP at:
http://docs.hp.com/hpux/pdf/B7885-90013.pdf
One assuems you ar eusing hte NFS Toolkit and have patched everything?
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Romaric Guilloud
Regular Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Stefan, I used to integrate our SAP Production env. into MC/SG 11.12 + SAP extension for SAP.
In order to prevent this problem, we (HP consulting and us) devoted a v-lan that we called sap-int for internal R3 routines such as the one you described (R3check, CI instance state, etc...).
If you can't do so, a workaround is to lower your network timeout within your cluster.ascii file and to have it monitored as a resource within your package, this way the package will be freed up for failover in case of network being down.
Got it?
Sincerely,

Romaric.
"And remember: There are no stupid questions; there are only stupid people." (To Homer Simpson, in "The Simpsons".)
Stefan FLOREA
Occasional Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Hi,

I kind of fixed it by setting soft option for nfs mount. The package stops now with NO_RESTART. So, still don't have failover in case of network down, but on a second thought, it seems quite reasonable.
Romaric, I have to admit I didn't get it. A resource failure is still initiating a failover, isn't it?

rgds,
stef

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Stef,

Having looked at the code in sap.functions I can't figure out WHY the CI package needs to test its database connection before proceeding with the shutdown (except to endeavour to shut down cleanly) The test_sb function doesn't actually DO anything if it does return from the R3trans call and the database isn't available except wait a specified period of time for it to become available. I question the point of checking for it during shutdown of the CI... that said you may find that the shutdown stop responding anyway during the calls to stopsap, when the application processes attempt to terminate their database connections.

You *could* just comment out the call of the test_db function in sapci.cntl (questionable whether you are supported if you do this)

- OR -

you could write your own version of test_db and put it in your customer.functions file, where it will override the version in sap.functions, possibly you could replace the line:

su - ${SIDADM} -c "eval ${R3TRANS} -d -w /dev/null" >/dev/null 2>&1

with something that kicks off R3trans in the background and then gives it x no of seconds to respond before killing it and failing. You would have to figure out a way of getting the return code of R3trans back to the main process though.

- OR -

you could simply put a timeout on the halt sctipt in the ci.conf file - whether thsi would work, or just end up stops responding on NFS mounts though is also questionable...


This is just another example of the endless issues you can have with clustering SAP, if this doesn't get you then you can bet the NFS mounted transports will!

HTH

Duncan


I am an HPE Employee
Accept or Kudo
Stefan FLOREA
Occasional Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Hi Duncan,

Thanks for suggestions.
I really questioned the call of test_db in the stop section too. It makes sens when going up with the CI as it has to wait for the db to go up too, but when going down....
Anyhow, I generally tried to avoid messing with the logic of SGeSAP scripts as I guess they have been well tested.

Romaric Guilloud
Regular Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Stefan,

Indeed a "resource" failure would cause a package to failover.
Now, what is a "resource"?: It's the lanICs defined in your cluster.ascii file (both must crash to trigger package failover), the SPUs, and any additional resource that you want to be monitored in your package (i.e. subnet, service, etc...).

With SAP extension for SAP comes a bunch of additional resources native from SAP, i.e a dw. process might not me monitored as a resource by default, whereas startup/shutdown of the dbci is.

To get going with your test, I do recommend to temporarly comment the test_return_52 functions withinn your pkg.cntl crontrol scripts (see customer_defined_run_cmds and customer_defined_halt_cmds) to see how your package behaves first as far as mounting the FS, exporting the NFS volumes, etc...

Otherwise these "test_return" functions will (as you experienced) exit with NO_RESTART and your node will no be able to run the pkg (throw a "cmviewcl -v |grep -e Primary -e Alternate" to see the switching attribute for these nodes that failed with NO_RESTART).

Then you can restart a given package on the node you want explicitely with:
cmmodpkg -e -n (set the node switch to ENABLED)
cmrunpkg -v -n , and issue after:
cmmodpkg -e to re-enable the package switching attribute to ENABLED.

Regards,

Romaric.
"And remember: There are no stupid questions; there are only stupid people." (To Homer Simpson, in "The Simpsons".)
Stefan FLOREA
Occasional Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Hi Romaric,

Thank you for the input.
I know how to avoid the NO_RESTART return.
But I still don't understand what's the difference between lan being monitored in the classic fashion, as an IP SUBNET that the package depends on, and lan being monitored as a package ressource.
thanks for help,stefan
Romaric Guilloud
Regular Advisor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Stef,
A lanic isn't a package resource unless you develop some specific monitoring scripting around it within your package control script.
Does it answer your question?
Regards,

Romaric.

(PS: Please do not forget to drop points to people who answered :-)
"And remember: There are no stupid questions; there are only stupid people." (To Homer Simpson, in "The Simpsons".)
Dietmar Konermann
Honored Contributor

Re: help!!! SGeSAP vicious sapci.cntl /sap.functions scripts?

Some additional thoughts concerning this (quite usual) problem.

The NFS server is a crucial resource in SGeSAP clusters. If you take down the entire network used by NFS (e.g. by pulling all plugs) then this resource becomes unavailable, which is especially in this case not easy to handle.

The CI package script gets stuck on test_db(), that's right... but without test_db() it would hang later, since it's quite impossible to halt the CI cleanly without NFS available.

OK, you could use soft mounts, but this is highly unrecommended by SAP, especially for /sapmnt/SID.

Essentially by pulling all plugs you simulate a situation that is not fully covered by your configuration. In fact, you are simulation a double-failure (I assume that you have no SPOF in you networking setup).

BTW, a one-package (DBCI) solution is better suited to cope with this situation and from my point of view it is the better approach in the vast majority of cases. Why not run a "small" CI on with the DB as one package and load the failover node with a "large" AS? This is easier to maintain and is often more robust.

If you need to have the two-package approach AND want to have the network outage handled, then IMHO only one really reliable solution is possible... configure a HALT_SCRIPT_TIMEOUT with NODE_FAIL_FAST_ENABLED YES to get the node TOCed where the package halt hangs.

Best regards...
Dietmar.
"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)