topic Re: Service guard cluster failed to halt - corrupted the filesystem in Operating System - Linux

Service guard cluster failed to halt - corrupted the filesystem

skd — Thu, 24 Sep 2009 08:08:15 GMT

Hello Everyone,

We are facing critical issue during the failover.

Cluster is failed to halt properly and corrupted the filesystem.

The senario is during halt
1. enexporting nfs file system
2. Stopping NFS service (failed)
3. unable to unmount the filesystem
4. fsck running on mounted filesystem corrupted the data (during startup of cluster)
5. Filesystem gone.

The filesystems are NFS exported.
==============
Aug 14 14:29:08 - Node "stcrm93a": Unexporting filesystem on *:/opt/Car
Aug 14 14:29:08 - Node "stcrm93a": Unexporting filesystem on *:/users
ERROR: sync_rmtab: can't open rmtab_sync file for write
ERROR: sync_rmtab: fail to export the rmtab data
ERROR: Aug 14 14:29:08 - Failed to stop NFS.
ERROR: Function verify_ha_server; Failed to stop HA servers
==================

Re: Service guard cluster failed to halt - corrupted the filesystem

skd — Thu, 24 Sep 2009 08:33:26 GMT

We have tried FS_UNMOUNT_COUNT=3 option earlier and it was not helpful to fix the issue.

After cluster failed to halt the cluster package, we have tried fuser and umount manually number of times.

The filesystems (/opt/Carmen & /users) which are part of this cluster are NFS exported and many clients are accessing through NFS.

Since the clients are accessing these filesystem through NFS, it is not allowing to unmount the filesystems.

We have simulated this in our test environment (without cluster) and we were able to unmount the filesystem only after
1. exportfs -a, 2. stopping NFS service, 3. umount

The Cluster control script is also trying to unexport the filesystems, then stopping NFS services and trying to unmount the filesystems, but it is failing when trying to stop the NFS service.

Suspecting NFS service stop will be the issue.
======
Please find attached cluster control script and log files.
=======

Please suggest how to resolve the issue.

Re: Service guard cluster failed to halt - corrupted the filesystem

Steven E. Protter — Thu, 24 Sep 2009 11:21:35 GMT

Shalom,

check the man page options in umount

You are probably getting a device busy on the umount. If in your SG configuration you use a more forceful option, you can probably kick out the users.

If this is like an Oracle database or something, you may need to configure a second package, or into this package to shut down immediate as part of the failover process.

SEP

Re: Service guard cluster failed to halt - corrupted the filesystem

skd — Thu, 24 Sep 2009 14:12:42 GMT

Thanks for the update.

Yes - we have tried umount -l
((-l Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore. This option allows a busy filesystem to be unmounted.))

But this was also not helpful.

Filesystem corrupted after using this option.

We have tried fuser & kill -9 to kill the process. But still issue.

No Oracle Database...NFS mounted Filesystems are using here.

Please let me know if need more details