Operating System - Linux
1825721 Members
3134 Online
109687 Solutions
New Discussion

Re: SGLX 12.80 - not resistant to file system overflow /tmp

 
SOLVED
Go to solution
yilmazaydin
Trusted Contributor

SGLX 12.80 - not resistant to file system overflow /tmp

Hello, we are faced with unplanned package stops during overflow of the temporary directory /tmp. 
Oracle database monitoring failed to complete its task:

Mar 29 15:57:50 root@sglx_node1 tkit_module.sh[6549]: Retrying 3 more time(s) before giving up.
/opt/cmcluster/oracletoolkit/hagetdbstatus.sh: line 33: cannot create temp file for here-document: No space left on device
Mar 29 15:57:54 root@sglx_node1 tkit_module.sh[6549]: Retrying 2 more time(s) before giving up.
/opt/cmcluster/oracletoolkit/halistener.mon: line 50: cannot create temp file for here-document: No space left on device
Mar 29 15:57:57 root@sglx_node1 tkit_module.sh[6544]: Oracle Listener unisvfe failure detected.
Mar 29 15:57:57 root@sglx_node1 tkit_module.sh[6544]: Oracle Listener unisvfe failed
/opt/cmcluster/oracletoolkit/hagetdbstatus.sh: line 33: cannot create temp file for here-document: No space left on device
Mar 29 15:57:58 root@sglx_node1 tkit_module.sh[6549]: Retrying 1 more time(s) before giving up.
Mar 29 15:58:00 root@sglx_node1 tkit_module.sh[6544]: All listeners have failedMar 29 15:57:50 root@sglx_node1 tkit_module.sh[6549]: Retrying 3 more time(s) before giving up.
/opt/cmcluster/oracletoolkit/hagetdbstatus.sh: line 33: cannot create temp file for here-document: No space left on device
Mar 29 15:57:54 root@sglx_node1 tkit_module.sh[6549]: Retrying 2 more time(s) before giving up.
/opt/cmcluster/oracletoolkit/halistener.mon: line 50: cannot create temp file for here-document: No space left on device
Mar 29 15:57:57 root@sglx_node1 tkit_module.sh[6544]: Oracle Listener unisvfe failure detected.
Mar 29 15:57:57 root@sglx_node1 tkit_module.sh[6544]: Oracle Listener unisvfe failed
/opt/cmcluster/oracletoolkit/hagetdbstatus.sh: line 33: cannot create temp file for here-document: No space left on device
Mar 29 15:57:58 root@sglx_node1 tkit_module.sh[6549]: Retrying 1 more time(s) before giving up.
Mar 29 15:58:00 root@sglx_node1 tkit_module.sh[6544]: All listeners have failed

I checked the hagetdbstatus script.sh - it uses the following script construction:

/usr/local/cmcluster/oracletoolkit/hagetdbstatus.sh: if [[ -f /tmp/ora_error_${SID_NAME}.txt ]] ; then
/usr/local/cmcluster/oracletoolkit/hagetdbstatus.sh: cat /tmp/ora_error_${SID_NAME}.txt

Is this a bug or a feature of SGLX Product?

I understood that bash by default creates temporary files in this directory or in the directory specified in the TMPDIR variable and in the same case, if the /tmp directory overflows, we would face the same problem - stopping the package.

 

YA

2 REPLIES 2
Mike_Chisholm
HPE Pro
Solution

Re: SGLX 12.80 - not resistant to file system overflow /tmp

I would position this as expected behavior. Serviceguard's primary role is to provide high availability to packaged applications. This means if the node currently running the application is experiencing a problem of some sort, Serviceguard should fail the package over to one of the other adoptive nodes. So the monitor may not be explicitly designed to detect and handle a full /tmp filesystem, I would not say the outcome (failure of the monitor service and faliover of the database) is a completely undesireable outcome from a HA perspective. A full /tmp file system can certainly destabilize a linux operating system leading to problems across many subsystems. Although it might or might not affect Oracle directly, it can affect many other operating system processes so in my mind this is a situation where a failover is probably desirable.

If /tmp is filling up repeatedly that should of course be fixed, either by growing it or figuring out why it keeps happening and stopping whatever it is that is filling it up.

I work for HPE.
yilmazaydin
Trusted Contributor

Re: SGLX 12.80 - not resistant to file system overflow /tmp

Hi @Mike_Chisholm

Thank you for a balanced answer.  I agree that any problems potentially negatively affecting the cluster node can also negatively affect the managed application.

 

YA.