cancel
Showing results for 
Search instead for 
Did you mean: 

GFS hangs after a couple of days

David Child_1
Honored Contributor

GFS hangs after a couple of days

Hello all,

I have set up a new 7-node GFS cluster (details below). After a couple of days one or more of the three GFS file systems hangs. On some of the nodes a 'df' will hang and on others it will work, but an 'ls' of the file system hangs on every node. At the time of the hang I run 'cman_tool services' and everything looks okay;

cman_tool services
type level name id state
fence 0 default 0001000c none
[10 11 12 13 14 18 19]
dlm 1 clvmd 0001000b none
[10 11 12 13 14 18 19]
dlm 1 u01 0002000a none
[10 11 12 13 14 18 19]
dlm 1 u02 0004000a none
[10 11 12 13 14 18 19]
dlm 1 u03 0006000a none
[10 11 12 13 14 18 19]
gfs 2 u01 0001000a none
[10 11 12 13 14 18 19]
gfs 2 u02 0003000a none
[10 11 12 13 14 18 19]
gfs 2 u03 0005000a none
[10 11 12 13 14 18 19]

(yes, the node IDs are not sequential from 0 - long story so unless that has anything to do with it I will not go into it)

All we get in the messages file out of the ordinary is;

Nov 5 23:48:47 server10 openais[5292]: [TOTEM] Retransmit List: ad0
Nov 5 23:48:47 sever10 openais[5292]: [TOTEM] Retransmit List: ae3
Nov 5 23:48:47 server10 openais[5292]: [TOTEM] Retransmit List: ae3
Nov 5 23:48:48 server10 openais[5292]: [TOTEM] Retransmit List: aff

I tried running 'gfs_tool' with various options on the affected file system(s), but they hang if I do. I then performed an strace while running 'df ' and it hangs while doing a stat call;

15925 stat("/u01",

When it's working correctly it looks more like this;

15877 stat("/u01", {st_mode=S_IFDIR|0775, st_size=3864, ...}) = 0

This is a brand new cluster and has only been up for less than a week so it's only happened twice. Unfortunately to keep the project going I have been unable to keep it in this locked up state for any length of time. I have had to reboot the entire cluster and get everything mounted again for the applicaton team to continue their work. I can see no pattern at this point as to when it's locking up.

Environment:
--------------
Hardware:
Two HP c-class chassis
HP Virtual Connect for network (10g uplink using VLAN tagging in a shared uplink set)
HP VIrtual Connect for SAN connectivity (connected to EMC Symmetrix)
Seven HP BL480c blade servers (5 servers in chassis0 and 2 in chassis1)

Software:
RHEL 5.2
Native multipathing
NIC bonding on the cluster interconnect

Any idea's?

Thanks,
David

1 REPLY
Steven E. Protter
Exalted Contributor

Re: GFS hangs after a couple of days

Shalom,

Use RHN/yum to update to the latest versions.

Make sure there is only one version of the gfs-kernel package installed.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com