topic Re: redhat 6.2 cluster in Operating System - Linux

redhat 6.2 cluster

karmellove — Sat, 24 Nov 2012 08:20:51 GMT

Hello Experts,

Please help me with the exact steps on configuring two node cluster on RHEL 6.2,
I failed to configure the simplest cluster by below steps,

1- install RHEL 6.2 64-bit on both nodes
2- add to hosts file of each server ( 1 IP in local NW and another in IP private NW).
x.x.x.x node1-pub
z.z.z.z node2-pub
y.y.y.y node1-pvt
t.t.t.t node2-pvt

3- yum install ricci ( on both nodes )
4- yum install luci ( on 1 node )
5- yum groupinstall "high availability" ( on both nodes )
6- from browser, node1-pub:8084 ( login and create new cluster )
give cluster name, nodes name are (node1-pvt),(node2-pvt)
7- cluster is UP with two nodes, so far.
========================================================
Now:
8- configure failover domain and select both nodes.
9- configure resource(IP) and give IP in same range of public network.
10- configure servicegroup and assign the failover domain and the IP resource to the servicegroup.
11- IP doesn't start.

==========
Nov 24 02:59:37 rgmanager start on ip "10.10.4.223/255.255.255.0" returned 1 (generic error)
Nov 24 02:59:37 rgmanager #68: Failed to start service:vip; return value: 1
Nov 24 02:59:37 rgmanager Stopping service service:vip
Nov 24 02:59:37 rgmanager [ip] 10.10.4.223/255.255.255.0 is not configured
Nov 24 02:59:37 rgmanager Service service:vip is recovering
Nov 24 02:59:38 rgmanager #71: Relocating failed service service:vip
==========
from luci i get this error

Starting cluster "cluname" service "vip" from node "node1-pvt" failed: vip is in unknown state 118

what did i miss?

please help.

thanks

Re: redhat 6.2 cluster

Matti_Kurkela — Sat, 24 Nov 2012 08:54:56 GMT

Please run these commands:

# rg_test test /etc/cluster/cluster.conf start service vip
# rg_test test /etc/cluster/cluster.conf stop service vip

What is the output?

According to RedHat documentation, if you use a netmask in an IP resource configuration, you should only specify the netmask length, not the fully-spelled-out netmask. In other words, use "10.10.4.223/24" instead of "10.10.4.223/255.255.255.0".

Re: redhat 6.2 cluster

karmellove — Sat, 24 Nov 2012 17:56:16 GMT

i modified the script to add /24 instead of 255.255.255.0, i have rebooted both nodes and now the IP is up on the first node,

below is the output of rg_test command,

===========================================================================

rg_test test /etc/cluster/cluster.conf
Running in test mode.
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/oralistener.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/fence_scsi_check.pl
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/checkquorum
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
address = 10.10.4.223/24 [ primary unique ]
monitor_link = on
nfslock [ inherit("service%nfslock") ]
sleeptime = 10

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
name = vip [ primary unique required ]
domain = Active_Passive [ reconfig ]
autostart = 1 [ reconfig ]
exclusive = 0 [ reconfig ]
nfslock = 0
nfs_client_cache = 0
recovery = relocate [ reconfig ]
depend_mode = hard
max_restarts = 0
restart_expire_time = 0
priority = 0

=== Resource Tree ===
service (S0) {
name = "vip";
domain = "Active_Passive";
autostart = "1";
exclusive = "0";
nfslock = "0";
nfs_client_cache = "0";
recovery = "relocate";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
priority = "0";
ip (S0) {
    address = "10.10.4.223/24";
    monitor_link = "on";
    nfslock = "0";
    sleeptime = "10";
}
}
=== Failover Domains ===
Failover domain: Active_Passive
Flags: none
Node node1 (id 1, priority 0)
Node node2 (id 2, priority 0)
=== Event Triggers ===
Event Priority Level 100:
Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl

===========================================================================

Now, i haven't configured any fencing so far, the VIP is running on the first node, now i powered it off, cluster has detected that it is off, but the vip is still on the first node which is off, and it doesn't move to first node,

network-scripts]# clustat
Cluster Status for cluster @ Sat Nov 24 23:13:39 2012
Member Status: Quorate

Member Name                             ID   Status
------ ----                             ---- ------
node1                                       1 Offline
node2                                       2 Online, Local, rgmanager

Service Name                   Owner (Last)                   State
------- ----                   ----- ------                   -----
service:vip                    node1                          started

here is my cluster.conf

<?xml version="1.0"?>
<cluster config_version="10" name="cluster">
        <clusternodes>
                <clusternode name="node1" nodeid="1"/>
                <clusternode name="node2" nodeid="2"/>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="Active_Passive" nofailback="0" ordered="0" restricted="0">
                                <failoverdomainnode name="node1"/>
                                <failoverdomainnode name="node2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.10.4.223/24" monitor_link="on" sleeptime="10"/>
                </resources>
                <service domain="Active_Passive" name="vip" recovery="relocate">
                        <ip ref="10.10.4.223/24"/>
                </service>
        </rm>
</cluster>
=======================================================

any clue for that? is it because that i didn't configure fencing?

when the first node comes up, it will not join the cluster, it will just start another one considering the second node is off.

how to overcome this?

Thanks

Re: redhat 6.2 cluster

Matti_Kurkela — Sun, 25 Nov 2012 00:02:01 GMT

The IP configuration looks good now.

> Now, i haven't configured any fencing so far, the VIP is running on the first node, now i powered it off, cluster has

> detected that it is off, but the vip is still on the first node which is off, and it doesn't move to first node,

This is exactly because you have not configured fencing yet. Fencing is a mandatory part of a functioning RedHat Cluster configuration. When a node becomes unresponsive, service failover can happen only after the fencing agent confirms that the unresponsive node has been successfully fenced. If there is no fencing configured, this step cannot complete, and the cluster cannot perform any failover operations.

> when the first node comes up, it will not join the cluster, it will just start another one considering the second node is off.

This indicates a problem in cluster heartbeat delivery. Remember that the RedHat cluster heartbeat is multicast-based.

This Cisco document describes a known issue in some (many?) Cisco switches regarding multicasts and IGMP snooping, and lists five ways to fix or work around it.

http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a008059a9df.shtml

I've seen this problem happen when the nodes are connected to different switches (i.e. located in different server rooms, separated by at least a fire-resistant wall). When both nodes join the cluster simultaneously, they both send IGMP messages to join the appropriate multicast group. The switches broadcast the IGMP messages since the multicast group is not known to them yet, so both switches become aware (by IGMP snooping) that the multicast group traffic needs to be delivered from one switch to the other. So far, so good.

But when a node goes down, its multicast group membership times out. At that point, the switches notice that there is only one node that is interested in that multicast traffic, so the multicast traffic delivery between the switches is stopped. But when the failed node comes back up, IGMP snooping erroneously does not re-activate the multicast delivery between the switches, and as a result, each node thinks it is alone and you have a "split-brain" situation.

With the Cisco IGMP snooping implementation, this happens when there is no source of IGMP group membership queries in the network segment. Such a source can be a real multicast router, or a "IGMP querier": a thing that sends IGMP queries exactly like a multicast router, but does not actually route any multicasts at all. It allows Cisco IGMP snooping to work correctly within a network segment.

You'll need to contact your network administrators to solve this issue. Show them the Cisco link (above) and request that they check for similar multicast behavior in the network segment that is used for your cluster heartbeat.

You'll need to get this fixed before configuring fencing: if you configure fencing while this problem is still in effect, a heartbeat network failure will trigger a see-saw effect: node A sees B become unresponsive, so it fences node B. If power fencing is used, this causes node B to crash and reboot... at which point it will establish a new cluster, see that node A is unresponsive, and will fence node A in turn. This cycle can repeat indefinitely.

(You could stop the see-saw effect by either setting up a quorum disk, or by configuring fencing so that a fenced node will not reboot but will stay down until manually restored instead. But these are both mitigating the symptoms instead of fixing the root cause.)

I have encountered this multicast issue often enough that I will always test the multicast functionality before even starting to set up a RedHat Cluster.

This RedHat Knowledge Base article is an index of cluster-related RHKB articles. Those articles contain a lot of information that is hard to find elsewhere. (RedHat Network access required)

https://access.redhat.com/knowledge/articles/47987

Re: redhat 6.2 cluster

karmellove — Sun, 25 Nov 2012 06:14:12 GMT

Thanks for your detailed response, that was really helping.

Is there a tool in linux to test the multicast?
can i use a cross cable for heartbeat between the nodes?
Any best-practice for the heartbeat setup and IP configurations?

I will use quorum with 3 votes and 1 vote for each node, could you please check the below cluster.conf? am i getting it right?
i am a little bit confuced with the concept redhat uses for the cluster, i am an HPUX fanatic.
now for the vote calculations, i made cman expected votes (5), 3 QDisk, 1 for each node = 5. how the cluster is calculation the votes to
become up?

<?xml version="1.0"?>
<cluster config_version="20" name="cluster">
        <clusternodes>
                <clusternode name="node1" nodeid="1"/>
                <clusternode name="node2" nodeid="2"/>
        </clusternodes>
        <cman expected_votes="5"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="Active_Passive" nofailback="0" ordered="0" restricted="0">
                                <failoverdomainnode name="node1"/>
                                <failoverdomainnode name="node2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.10.4.223/24" monitor_link="on" sleeptime="10"/>
                </resources>
                <service domain="Active_Passive" name="vip" recovery="relocate">
                        <ip ref="10.10.4.223/24"/>
                </service>
        </rm>
        <quorumd device="/dev/sdb" min_score="1" votes="3">
                <heuristic program="ping -c1 -t1 "Gateway_IP" "/>
        </quorumd>
</cluster>

/etc/hosts
10.10.4.221   server1.example.com server1
10.10.4.222   server2.example.com server2
172.16.26.1     node1.example.com node1
172.16.26.2     node2.example.com node2
10.10.4.223     vip.example.com vip

Re: redhat 6.2 cluster

Matti_Kurkela — Sun, 25 Nov 2012 11:15:41 GMT

RHEL 6.1 and newer versions have a multicast testing tool named "omping" included in the distribution.

This RHKB article includes a python script that can be used to test multicast on any system that has a python interpreter installed:

https://access.redhat.com/knowledge/solutions/116553

According to the latest information, using a crossover cable is possible for testing but not recommended in production:

https://access.redhat.com/knowledge/solutions/151203

Cluster, High Availability, and GFS Deployment Best Practices:

https://access.redhat.com/knowledge/articles/40051

Remember that if you have a two-node cluster with no quorum disk, the vote calculation does not happen at all (two-node mode).

The cluster quorum vote calculation is very simple: if the cluster (or a part of a cluster that has lost heartbeat connection to the rest of the cluster) has more than 50% of total expected votes, the cluster has quorum and is allowed to act (i.e. accept new nodes, fence unresponsive nodes, failover services from unresponsive nodes). If there are not enough votes, there is no quorum. If there is no quorum, a cluster cannot form, and an existing cluster cannot perform any failover actions. If the part of the cluster that lost quorum was running any services, it may keep running them unless it gets fenced or gets an eviction notice through the quorum disk.

(If the quorum is lost because all the other nodes are down, keeping the currently-running services running is the best the node can do until an admin can confirm the other nodes are in fact down. If the other nodes are OK but unreachable because of a fault in the heartbeat network, the other nodes will attempt to fence the isolated node and will take over its services if fencing is successful. If the isolated node reboots, it cannot rejoin the cluster because it has no quorum alone.)

When a quorum disk is used, it has an entirely separate logic for deciding when & where to grant the quorum disk votes.

The quorum disk daemon in each node will use the configured commands ("heuristics") to determine if the node is in a serviceable condition or not. If the node is in good condition, it will update its message slot on the shared quorum disk with a timestamp, in effect saying "I'm alive". The quorum disk daemon on the "alive" node with the smallest node ID will become the "master" quorum disk daemon: this node will grant the quorum disk votes. The extra votes will have an effect in the main quorum voting: this is the only connection between the quorum disk daemon and the main cluster quorum voting logic.

Increasing the number of qdisk votes is not really necessary in a two-node cluster: the optimum number of votes the quorum disk needs to have to achieve "last-man-standing" configuration is always the number of nodes minus 1. In a two-node cluster, this value is 1.

(Last-man-standing: even if all the other nodes fail, the single remaining node can keep running and maintain quorum alone. It can even reboot and restart cluster services alone without manual intervention, if necessary.)

With the optimum number of quorum disk votes, you can shut down the quorum disk daemons and the cluster will still have quorum: this is useful if you need to reconfigure your quorum disk while the cluster is running. It will also make the quorum disk failure a non-critical event for the cluster.

If you have more than the optimum number of quorum disk votes, you will lose cluster quorum if the quorum disk fails. This can make the quorum disk a single point of failure in your cluster.

Re: redhat 6.2 cluster

karmellove — Sun, 25 Nov 2012 16:49:45 GMT

Now i feel more confident, regarding the fencing, i will use brocade fencing, should i connect the brocade to the cluster network or to the server's data network?

Re: redhat 6.2 cluster

Matti_Kurkela — Sun, 25 Nov 2012 17:40:40 GMT

You'll need fencing when your heartbeat fails, so putting fencing on the same network as heartbeat might not be a very good idea.

Basically do whatever is appropriate in your set-up to make sure that fencing and heartbeat connections don't fail together.

Re: redhat 6.2 cluster

karmellove — Tue, 27 Nov 2012 03:32:58 GMT

Hello,

i have configured the cluster, it is working perfectly except for the fencing,

i am using EMC Fabric os (DS_300B) Fabos version 6.3.2b

when i use the commands to fence_brocade -a IP -l user -p password -o enable/disable it returns successfull, but i doesn't really work when the cluster node is fencing the other.

fence_tool ls returns fencing and i have to fence_ack_manual to move the services,

then fence_node -U

Do i have to do something for that?

Thanks

Re: redhat 6.2 cluster

karmellove — Tue, 27 Nov 2012 03:39:14 GMT

# fence_brocade -a 192.168.201.167 -l admin -p password -n 2 -o enable
success: portenable 2
# fence_brocade -a 192.168.201.168 -l admin -p password -n 2 -o enable
success: portenable 2

#fence_ack_manual node1
About to override fencing for node1.
Improper use of this command can cause severe file system damage.

Continue [NO/absolutely]? absolutely
Done

#grep brocade /etc/cluster/cluster.conf
<fencedevice agent="fence_brocade" ipaddr="192.168.201.167" login="admin" name="Brocade1" passwd="password"/>
<fencedevice agent="fence_brocade" ipaddr="192.168.201.168" login="admin" name="Brocade2" passwd="password"/>
Thanks

Re: redhat 6.2 cluster

Matti_Kurkela — Tue, 27 Nov 2012 21:21:41 GMT

You should read these if you use storage-based fencing like fence_brocade:

https://access.redhat.com/knowledge/solutions/220323

https://access.redhat.com/knowledge/solutions/236483

In a nutshell: while storage-based fencing certainly stops the fenced node from writing to the disks, the fenced node won't know about it immediately. Only after the fenced node makes an I/O request and sees it time out (after multiple retries) , the node becomes aware that it has been fenced, and can stop its network activity. Depending on multipath timeout and queue_if_no_path settings, the I/O requests may take a long time to time out - or with queue_if_no_path, they might never time out at all. You may need to tune the I/O request timeouts to be compatible with cluster timeouts.

Another issue is that when a node is fenced with a storage-based method, the fenced node will quite likely contain data in its buffers that has not been written to the shared disks - which are now inaccessible because of fencing. Normally, Linux kernel tries very very hard to not lose data, so even "reboot -f" will attempt to write out the buffers... so in the case of a SCSI-based fencing, the system will fail to reboot.

In this specific situation, you actually *want* to lose the content of the buffers, since the data may already been made obsolete by the other node taking over the service. You should never unfence a cluster node that has been fenced by a storage-based method without first making sure the node has no longer any data destined to the shared disks in its buffers. A hard reboot by power-cycling is the surest way to do this.