Operating System - Linux
1821051 Members
2889 Online
109631 Solutions
New Discussion юеВ

Rehat AS3 Update 6 Cluster suite

 
Mike Hedderly
Advisor

Rehat AS3 Update 6 Cluster suite

I am running redhat cluster suite on 2 dl380 servers. when running clustat one member remains fairly stable but the second member switches from active to inactive every 5 seconds or so. I have bonded my two nics together and enable 8021q trunk so I have my cluster traffic on bond0.13. I can always ping each cluster member. This is the error i get in the messages file

Jan 20 15:49:56 ralph clusvcmgrd[4311]: State change: huey-c UP
Jan 20 15:49:57 ralph clumembd[4144]: Member huey-c DOWN
Jan 20 15:49:58 ralph clumembd[4144]: Membership View #7350:0x00000001
Jan 20 15:49:59 ralph cluquorumd[4119]: --> Commencing STONITH <--
Jan 20 15:49:59 ralph cluquorumd[4119]: STONITH: Falsely claiming that
huey-c has been fenced
Jan 20 15:49:59 ralph cluquorumd[4119]: STONITH: Data integrity may be co
mpromised!
Jan 20 15:50:00 ralph clusvcmgrd[4311]: Quorum Event: View #12657 0x00000
001
Jan 20 15:50:00 ralph clusvcmgrd[4311]: State change: huey-c DOWN
Jan 20 15:50:08 ralph clumembd[4144]: Member huey-c UP
Jan 20 15:50:12 ralph clumembd[4144]: Member huey-c DOWN


4 REPLIES 4
Mike Hedderly
Advisor

Re: Rehat AS3 Update 6 Cluster suite

some further information. I am running clumanager-1.2.28-1 and redhat-config-cluster-1.0.8-1

I do not have any power switches and the external disks are on an MSA100 via a fibre chanel.
Steven E. Protter
Exalted Contributor

Re: Rehat AS3 Update 6 Cluster suite

Shalom Mike,

I don't think you've fully configured the cluster.

STONITH: Falsely claiming that
huey-c has been fenced

Shoot
The
Other
Node
In
The
Head

Its trying to shut down the other node becasue it thinks its down or there is a risk of data corruption.

Checklist:
MSA1000 firmware up to date
sansurfer package on both servers to check the state of shared storage
shared storage is configured so the sd# devices are the same on both nodes.
Firmware on the qlogic cards is the same on all cards, all servers and reasonably up to date.
Cluster configuration files.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Vitaly Karasik_1
Honored Contributor

Re: Rehat AS3 Update 6 Cluster suite

I suggest you re-check cluster configuration according to RH doc http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/ch-software.html
(Chapter 3)
Rgds,
Vitaly
John McNulty_2
Frequent Advisor

Re: Rehat AS3 Update 6 Cluster suite


Thanks for the advise but we found the problem. The STONITH errors were a red herring. This cluster has no Power Switches so its not possible to STONITH a node that the cluster perceives has changed to a "down" state.

The cause of "huey" dropping in and out of the cluster every few seconds turned out to be a clash between two Redhat clusters using the same 255.0.0.11 multicast address elsewhere on the same network. We changed the multicast address to be unique, reloaded the config, restarted the cluster and the problem has gone away. The cluster is stable now.

Would have been nice for Redhat to have reported this somewhere. We only discovered what was going on after pinging the multicast address and seeing more DUP responses than we were expecting and from IP addresses belonging to the other Redhat cluster.