Operating System - Linux
1748017 Members
3524 Online
108757 Solutions
New Discussion

Re: Redhat cluster is not working properly

 
senthil_kumar_1
Super Advisor

Redhat cluster is not working properly

 

I configured the cluster on two nodes (with out Quorum disk) for the services httpd and ftpd. That is, httpd will run from node1 all the time and ftp will run from node2 all the time, if node1 fails, then node2 will host both http and ftp, and If node 2 fails, then node1 will host both http and ftp...

 

 

Hardware details of nodes:

 

Both nodes are “ProLiant BL460c G7”

 

I have configured direct ILO login on both nodes.

 

I used following method to configure cluster:

 

1)First installed OS, assigned IP address and host name for nodes:

 

 

Node names:

 

Node1: emdlagpbw01 (10.250.1.97)

 

Node2: emdlagpbw02 (10.250.1.98)

 

Each node is able to ping each other server by hostname and IP address.

 

2)Configured the cluster through "system-config-cluster" on node1.

 

My cluster configuration:

 

# more /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster alias="clu" config_version="14" name="clu">
        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="emdlagpbw01.emdna.emdiesels.com" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="EMDLAGPBW01R"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="emdlagpbw02.emdna.emdiesels.com" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="EMDLAGPBW02R"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="emdlagpbw01R" login="xxx" name="EMDLAGPBW01R" passwd="xxxxx"/>
                <fencedevice agent="fence_ilo" hostname="emdlagpbw02R" login="xxx" name="EMDLAGPBW02R" passwd="xxxxx"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">
                                <failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
                                <failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="EMDLAGPBWCL2" ordered="1" restricted="1">
                                <failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="2"/>
                                <failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
   <resources>
                        <ip address="10.250.1.107/22" monitor_link="1"/>
                        <script file="/etc/init.d/httpd" name="httpd"/>
                        <ip address="10.250.1.108/22" monitor_link="1"/>
                        <script file="/etc/init.d/vsftpd" name="vsftpd"/>
                </resources>
                <service autostart="1" domain="EMDLAGPBWCL1" name="httpd">
                        <ip ref="10.250.1.107/22"/>
                        <script ref="httpd"/>
                </service>
                <service autostart="1" domain="EMDLAGPBWCL2" name="vsftpd">
                        <ip ref="10.250.1.108/22"/>
                        <script ref="vsftpd"/>
                </service>
        </rm>
</cluster>

 

3)copied the file /etc/cluster/cluster.conf to node2

 

4)Started "cman" and "rgmanager" services on node1 and node2...

 

But it is taking lot of time to start fencing on both nodes...

 

I am seeing following message from the log file "/var/log/messages" on both nodes.

 

Sep 13 10:44:10 emdlagpbw02 fenced[32371]: agent "fence_ilo" reports: Unable to connect/login to fencing device
Sep 13 10:44:10 emdlagpbw02 fenced[32371]: fence "emdlagpbw01.emdna.emdiesels.com" failed

 

 

My Questions:

 

1)Are my cluster configuration correct?

 

2)Where is the issue?

 

3)Why fencing is not starting, how to resolve this?

 

 

4)Why the services http and ftp is not started respectively on node1 and node2?

 

 

 

 

16 REPLIES 16
senthil_kumar_1
Super Advisor

Re: Redhat cluster is not working properly

Could any one please help me on this.
Jimmy Vance
HPE Pro

Re: Redhat cluster is not working properly

Knowing what model servers your working with would help. If your servers have iLO3 you need to use fence_ipmilan as the fencing agent and the fence user you create in iLO will need to have administrator privileges

 

A patch for fence_ilo is being worked on to enable iLO3 support

 

No support by private messages. Please ask the forum! 
Chhaya_Z
Valued Contributor

Re: Redhat cluster is not working properly

Hi Senthil,

 

Try below Redhat KB article. Might be helpful

 

https://access.redhat.com/kb/docs/DOC-57676

 

Regards,

Chhaya

Regards,
Chhaya

I am an HP employee.
Was this post useful? - You may click the KUDOS! star to say thank you.
senthil_kumar_1
Super Advisor

Re: Redhat cluster is not working properly

Dear All,

I changed fence device configuration as per following link "https://access.redhat.com/kb/docs/DOC-56880".


Right now my cluster.conf file in both nodes is...

# vi /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster alias="clu" config_version="14" name="clu">
<fence_daemon post_fail_delay="0" post_join_delay="20"/>
<clusternodes>
<clusternode name="emdlagpbw01.emdna.emdiesels.com" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW01R"/>
</method>
</fence>
</clusternode>
<clusternode name="emdlagpbw02.emdna.emdiesels.com" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW02R"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.113" login="xxx" name="EMDLAGPBW01R" passwd="xxxxxx"/>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.143" login="xxx" name="EMDLAGPBW02R" passwd="xxxxxx"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
</failoverdomain>
<failoverdomain name="EMDLAGPBWCL2" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="2"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.250.1.107/22" monitor_link="1"/>
&lt;script file="/etc/init.d/httpd" name="httpd"/>
<ip address="10.250.1.108/22" monitor_link="1"/>
&lt;script file="/etc/init.d/vsftpd" name="vsftpd"/>
</resources>
<service autostart="1" domain="EMDLAGPBWCL1" name="httpd" recovery="relocate">
<ip ref="10.250.1.107/22"/>
&lt;script ref="httpd"/>
</service>
<service autostart="1" domain="EMDLAGPBWCL2" name="vsftpd" recovery="relocate">
<ip ref="10.250.1.107/22"/>
&lt;script ref="httpd"/>
</service>
<service autostart="1" domain="EMDLAGPBWCL2" name="vsftpd" recovery="relocate">
<ip ref="10.250.1.108/22"/>
&lt;script ref="vsftpd"/>
</service>
</rm>
</cluster>


After having same cluster.conf file in both nodes, I started the service "cman" on node1 "emdlagpbw01.emdna.emdiesels.com"...

Example:

# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]

Output of log file "/var/log/messages" in node1:

Sep 19 11:23:13 emdlagpbw01 openais[11315]: [CLM ] got nodejoin message 10.250.1.97
Sep 19 11:23:14 emdlagpbw01 ccsd[11306]: Initial status:: Quorate
Sep 19 11:24:21 emdlagpbw01 fenced[11335]: emdlagpbw02.emdna.emdiesels.com not a cluster member after 20 sec post_join_delay
Sep 19 11:24:21 emdlagpbw01 fenced[11335]: fencing node "emdlagpbw02.emdna.emdiesels.com"
Sep 19 11:24:37 emdlagpbw01 fenced[11335]: fence "emdlagpbw02.emdna.emdiesels.com" success


Now the node 2 "emdlagpbw02.emdna.emdiesels.com" has been fenced (rebooted automatically)...


Once node 2 came up, I started the service "cman" on node2, Now node 1 has been fenced...


Output of log file "/var/log/message" in node2:

Sep 19 11:31:14 emdlagpbw02 ccsd[7559]: Initial status:: Quorate
Sep 19 11:32:21 emdlagpbw02 fenced[7587]: emdlagpbw01.emdna.emdiesels.com not a cluster member after 20 sec post_join_delay
Sep 19 11:32:21 emdlagpbw02 fenced[7587]: fencing node "emdlagpbw01.emdna.emdiesels.com"
Sep 19 11:32:37 emdlagpbw02 fenced[7587]: fence "emdlagpbw01.emdna.emdiesels.com" success


How to solve this issue, Please help me...

Thanks a lot in advance..
Matti_Kurkela
Honored Contributor

Re: Redhat cluster is not working properly

Apparently your cluster nodes aren't receiving the multicast sent by each other.

 

The steps listed in

https://access.redhat.com/kb/docs/DOC-57237

could be useful in confirming the problem. (Note: when you ping the cluster multicast address with a two-node cluster, you should get two responses for each outgoing ping message. If you get just one, multicast is not working in your network.)

 

This is a known issue with Cisco switches (and possibly other switches with a similar IGMP snooping implementation). Please see this document for the root cause analysis and links to Cisco documents with suggested solutions:

https://access.redhat.com/kb/docs/DOC-57238

 

Direct link to the most relevant Cisco document:

http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a008059a9df.shtml

 

In a nutshell: in Cisco switches, the IGMP snooping feature is enabled by default, but it will only work correctly if there is a multicast-enabled router (mrouter) or some other source of IGMP queries in the same network segment/VLAN. If you don't need a full multicast routing functionality (i.e. if you only need multicast to work within a particular network segment/VLAN), you can use a IGMP Querier function that is built into some Cisco switches. The IGMP Querier will send IGMP queries just like a mrouter, but won't actually route any multicast packets at all. But the queries, and the nodes' responses to them, will allow the IGMP snooping feature of the switches to work as designed.

MK
Arunabha Banerjee
Valued Contributor

Re: Redhat cluster is not working properly

I think there is some problem in your cluster.conf file. I can't understand why you have mentioned "failoverdomains" two times.

 

<failoverdomains>                                                                        
        <failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">                  
                <failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
                <failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
        </failoverdomain>                                                                
        <failoverdomain name="EMDLAGPBWCL2" ordered="1" restricted="1">                  
                <failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="2"/>
                <failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="1"/>
        </failoverdomain>                                                                
</failoverdomains>                                                                       

 

Can you please replace cluster.conf file with the following entry.

 

<failoverdomains>                                                                        
        <failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">                  
                <failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
                <failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
	</failoverdomain>                                                                
</failoverdomains>                                                                       

 

Then stop and start cman and rgmanager service.

 

[root@emdlagpbw01 ~]# service rgmanager stop 
[root@emdlagpbw01 ~]# service cman stop      
                                             
                                             
[root@emdlagpbw01 ~]# service cman start     
[root@emdlagpbw01 ~]# service rgmanager start
[root@emdlagpbw01 ~]# clustat                

 

 

AB
Matti_Kurkela
Honored Contributor

Re: Redhat cluster is not working properly

> I think there is some problem in your cluster.conf file. I can't understand why you have mentioned "failoverdomains" two times.

 

This is not a problem.

 

Senthil's configuration file looks like he wants to run 3 cluster services, so that the httpd service normally runs on node 1 and the two vsftpd instances on node 2. To make this happen, Senthil needs two failover domain definitions: the first failoverdomain EMDLAGPBWCL1 has node 1 as the top priority, while the second domain has the node priorities in the reverse order. Then he assigns the httpd server to failover domain EMDLAGPBWCL1, and the vsftpd's to EMDLAGPBWCL2.

 

Each service has autostart enabled: together with the failover domain definitions, this makes the cluster automatically start each service on the node Senthil prefers them, unless there is a problem that requires the service to failover elsewhere.

 

If Senthil had only one failover domain as you suggest, all the services would normally run on the first node.

 

This is actually a great philosophical question failover clustering:

  • do you want to keep the spare node(s) idle, so that you know for sure the response times of your services won't degrade if you have to failover?
  • or do you want to provide the best possible response time you can (by distributing the services over the whole set of nodes) in the normal situation, but are willing to accept some degradation when a failover happens?

With a two-node cluster, the first option is expensive: you have the computing power of one server, but the hardware costs of two. The second option allows you to get more use of your hardware, but you must carefully track your workloads and remember that if one node fails, all the workload will be moved to the remaining node, which may become overloaded. But if you understand and accept that risk, and have a plan to mitigate the risk, that's OK. (Perhaps one of the services is less critical than the others and can be shut down in case of overload? Or perhaps Senthil's boss has some estimates of the expected usage of the cluster services, and figures he can authorize the purchase of more hardware well before the workload will become so heavy it cannot all be run on a single node any more.)

MK
Adam Garsha
Valued Contributor

Re: Redhat cluster is not working properly

I've seen this ping pong fencing happening if the box'es time is slightly off. I recommend trying:

 

I'd bring each box up to single user mode (if you have cluster starting when box comes up). Then:

 

ifup mynic

ntpdate mytime.server.foo # using your time server

hwclock --systohc

date; hwclock --show # verify they are mated

reboot

 

That works for me e.g. if I move a blade and time gets fracked.

senthil_kumar_1
Super Advisor

Re: Redhat cluster is not working properly

Hi All,

Thanks a lot for your support to resolve this issue..

Right now my cluster is working fine with following configuration.

# vi /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="clu" config_version="14" name="clu">
<fence_daemon post_fail_delay="0" post_join_delay="20"/>
<clusternodes>
<clusternode name="emdlagpbw01.emdna.emdiesels.com" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW01R"/>
</method>
</fence>
</clusternode>
<clusternode name="emdlagpbw02.emdna.emdiesels.com" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW02R"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1" broadcast="yes"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.113" login="tcs" name="EMDLAGPBW01R" passwd="tCs12345"/>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.143" login="tcs" name="EMDLAGPBW02R" passwd="tCs12345"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
</failoverdomain>
<failoverdomain name="EMDLAGPBWCL2" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="2"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.250.1.107/22" monitor_link="1"/>
&lt;script file="/etc/init.d/httpd" name="httpd"/>
<ip address="10.250.1.108/22" monitor_link="1"/>
&lt;script file="/etc/init.d/vsftpd" name="vsftpd"/>
</resources>
<service autostart="1" domain="EMDLAGPBWCL1" name="httpd" recovery="relocate">
<ip ref="10.250.1.107/22"/>
&lt;script ref="httpd"/>
</service>
<service autostart="1" domain="EMDLAGPBWCL2" name="vsftpd" recovery="relocate">
<ip ref="10.250.1.108/22"/>
&lt;script ref="vsftpd"/>
</service>
</rm>
</cluster>


So it is working fine with out file system configured...

Now I configured file system like below.

# vi /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="clu" config_version="14" name="clu">
<fence_daemon post_fail_delay="0" post_join_delay="20"/>
<clusternodes>
<clusternode name="emdlagpbw01.emdna.emdiesels.com" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW01R"/>
</method>
</fence>
</clusternode>
<clusternode name="emdlagpbw02.emdna.emdiesels.com" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="EMDLAGPBW02R"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1" broadcast="yes"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.113" login="tcs" name="EMDLAGPBW01R" passwd="tCs12345"/>
<fencedevice agent="fence_ipmilan" power_wait="10" lanplus="1" ipaddr="10.254.1.143" login="tcs" name="EMDLAGPBW02R" passwd="tCs12345"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="EMDLAGPBWCL1" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="1"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="2"/>
</failoverdomain>
<failoverdomain name="EMDLAGPBWCL2" ordered="1" restricted="1">
<failoverdomainnode name="emdlagpbw01.emdna.emdiesels.com" priority="2"/>
<failoverdomainnode name="emdlagpbw02.emdna.emdiesels.com" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.250.1.107/22" monitor_link="1"/>
&lt;script file="/etc/init.d/httpd" name="httpd"/>
<fs device="/dev/sda2" force_fsck="0" force_unmount="1" fsid="33611" fstype="ext3" mountpoint="/test_node1_filesystem" name="test_node
1" options="" self_fence="0"/>
<ip address="10.250.1.108/22" monitor_link="1"/>
&lt;script file="/etc/init.d/vsftpd" name="vsftpd"/>
<fs device="/dev/sda3" force_fsck="0" force_unmount="1" fsid="54001" fstype="ext3" mountpoint="/test_node2_filesystem" name="test_node
2" options="" self_fence="0"/>
</resources>
<service autostart="1" domain="EMDLAGPBWCL1" name="httpd" recovery="relocate">
<ip ref="10.250.1.107/22"/>
&lt;script ref="httpd"/>
<fs ref="test_node1"/>
</service>
<service autostart="1" domain="EMDLAGPBWCL2" name="vsftpd" recovery="relocate">
<ip ref="10.250.1.108/22"/>
&lt;script ref="vsftpd"/>
<fs ref="test_node2"/>
</service>
</rm>
</cluster>

For your information:

Cluster services "httpd" and "vsftpd" started in both servers. But only script and IP resource are started but not file system...

File system is not mounted in both nodes...

I have assigned same LUN (/dev/sda) for both nodes, so both nodes are able to see the partition like "/dev/sda2" and "/dev/sda3"...

And I have created empty directories "/test_node1_filesystem" and "/test_node2_filesystem" on both nodes..

But I am not able to see mount point "/test_node1_filesystem" in node1 and "/test_node2_filesystem" in node2 but rest of the resources like scrip "httpd" and IP "10.250.1.107" are started in node1 and script "vsftpd" and IP "10.250.1.108" are started in node1.


How to trouble shoot the issue..