1752767 Members
5249 Online
108789 Solutions
New Discussion юеВ

lolck disk problem

 
edi_4
Advisor

lolck disk problem

Hi - I hawe 2 node linux (SLES10 sp2) serviceguard cluster. Storage we are using is EVA. I have configured lock disk on EVA. What happens is - when I reboot node1, node2 is also rebooting because it can not get lock disk. Node1 survive if I reboot node2. Lock lun is accesible from both nodes...
How can I find out whai is going on - Thank you.

Trehe is log file:
May 6 15:25:01 opera2 cmcld[25936]: Obtaining Cluster Lock
May 6 15:25:01 opera2 cmdisklockd[25959]: Obtaining cluster lock device /dev/sdi1
May 6 15:25:01 opera2 cmdisklockd[25959]: Unable to obtain the lock!
May 6 15:25:01 opera2 cmcld[25936]: Attempting to form a new cluster
May 6 15:25:01 opera2 cmcld[25936]: Beginning standard election
May 6 15:25:07 opera2 cmcld[25936]: Obtaining Cluster Lock
May 6 15:25:07 opera2 cmdisklockd[25959]: Obtaining cluster lock device /dev/sdi1
May 6 15:25:07 opera2 cmdisklockd[25959]: Unable to obtain the lock!
May 6 15:25:07 opera2 cmcld[25936]: Attempting to form a new cluster
May 6 15:25:07 opera2 cmcld[25936]: Beginning standard election
May 6 15:25:12 opera2 cmcld[25936]: Obtaining Cluster Lock
May 6 15:25:12 opera2 cmdisklockd[25959]: Obtaining cluster lock device /dev/sdi1
May 6 15:25:12 opera2 cmdisklockd[25959]: Unable to obtain the lock!

5 REPLIES 5
Matti_Kurkela
Honored Contributor

Re: lolck disk problem

In Linux, the mapping between the actual storage LUNs and /dev/sd* devices is not at all guaranteed to stay the same before & after a reboot. You should specify your lock device using a device path that is guaranteed to be persistent across reboots - anything else is asking for trouble.

Is /dev/sdi1 *really* your lock disk device *now*? It certainly was back when you originally set up ServiceGuard, but that does not say anything about the current situation.

Please run "fdisk -l /dev/sdi" to verify that the LUN actually contains the lock partition. If it doesn't, you'll need to change your cluster ASCII file to point to the lock LUN using some persistent device name, and re-apply the cluster configuration.

I'm not very familiar with SLES, but Google tells me SLES10 has dm-multipath just like RHEL 4 and newer. Apparently the name of the necessary package is "multipath-tools". Make sure it is installed.

Then run "multipathd -v2" to initialize the multipath system, then "multipath -l" to see the mapping between the multipath device names and the regular /dev/sd* devices. Make sure "multipathd" gets started automatically at boot and start it now if necessary: it is responsible for updating the multipath mappings automatically.

The multipath device name for your lock disk will be something like "/dev/mapper/p1", where will either be the WWID of the multipathed disk (=a string of hex digits) or a "mpathX"-style "friendly name". The "p1" at the end is the partition identifier. SLES seems to default to WWIDs, while RedHat uses friendly names. If you don't like the default naming style, change it in /etc/multipath.conf.

If you don't want to use multipathing for some reason, check the /dev/disk/by-* directories: in Linux distributions with modern udev, these will offer various ways to identify your disk devices in a persistent manner.

MK
MK
Serviceguard for Linux
Honored Contributor

Re: lolck disk problem

Matti is right about making sure that your multi-pathing is correct and that you have the disks set up with persistent names (check docs and cert matrix for more info).

Also, you didn't say whether this ever worked or not. Depending on the EVA make sure your FW is up-to-date. Some of the older EVAs were active passive devices that could possibly cause this.
edi_4
Advisor

Re: lolck disk problem

Thank's for replay - Sometimes happens it works - a lock disk is accesible. The name of lock disk survive reboot - it is always sdi1. If already tried to use udev but the name of the lock disk is to long. I just dont know how to remame it.

cluster conf file is:

NODE_NAME opera2
NETWORK_INTERFACE bond0
STATIONARY_IP 192.168.99.107
NETWORK_INTERFACE bond1
HEARTBEAT_IP 192.168.240.2
# CLUSTER_LOCK_LUN /dev/sdi1
CLUSTER_LOCK_LUN /dev/mapper/3600508b40006836800017000010e0000-part1

opera2:/edi # cmcheckconf -v -C file1
Begin cluster verification...
Checking cluster file: file1
Value specified for CLUSTER_LOCK_LUN at line 104 is too long. Its length should not exceed 39 charaters
cmcheckconf: Error found in cluster file: file1.
opera2:/edi #



Lock LUN from EVA is 9:

opera1:/dev/mapper # multipath -l
3600508b40006836800017000010e0000 dm-18 HP,HSV200
[size=2.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=-2][active]
\_ 1:0:1:9 sdcf 69:48 [active][undef]
\_ 0:0:1:9 sdah 66:16 [active][undef]
\_ round-robin 0 [prio=-2][enabled]
\_ 1:0:0:9 sdbg 67:160 [active][undef]
\_ 0:0:0:9 sdi 8:128 [active][undef]

opera2:/etc/init.d # multipath -l
3600508b40006836800017000010e0000 dm-18 HP,HSV200
[size=2.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=-2][active]
\_ 1:0:1:9 sdcf 69:48 [active][undef]
\_ 0:0:0:9 sdi 8:128 [active][undef]
\_ round-robin 0 [prio=-2][enabled]
\_ 1:0:0:9 sdbg 67:160 [active][undef]
\_ 0:0:1:9 sdah 66:16 [active][undef]

opera1:/dev/mapper # dmsetup ls | grep 10e
3600508b40006836800017000010e0000 (253, 18)
3600508b40006836800017000010e0000-part1 (253, 37)
opera1:/dev/mapper #

opera2:/etc/init.d # dmsetup ls | grep 10e
3600508b40006836800017000010e0000 (253, 18)
3600508b40006836800017000010e0000-part1 (253, 39)
opera2:/etc/init.d #



edi_4
Advisor

Re: lolck disk problem

The only way to survive reboot is to change NODE_IDLE_TIMEOUT to 20sec. Then the lock disk can be obtained - cluster become lazy
it needs 4 minutes to reconfigure...
Perhaps I missing sometething - or there is bug...
Stephen_126
Occasional Advisor

Re: lolck disk problem

 

Somebody coded CLUSTER_LOCK_LUN checks wrong within cmcheckconf, it can't take standard mapper names (too long).

SO, my LOCK_LUN is /dev/mapper/3600blah-blah0000_part1 .  I put in a rename at the top of 'start' case within    /etc/init.d/cmcluster script as such :

 /sbin/dmsetup rename /dev/mapper/3600blah-blah0000 CLUDISK  2>/dev/null

 /sbin/dmsetup rename /dev/mapper/3600blah-blah0000_part1 CLUDISK_part1  2>/dev/null

 

NOTE: you MUST have underscore   "..  _part1"  , the err redirect because it may already exist

 

 

And used it in clu conf file:
CLUSTER_LOCK_LUN       /dev/mapper/CLUDISK_part1

 

This is much easier than tweaking udev stuff which I first started to look at.  You must REALLY do the above so your disk is serviced via multipathing. Otherwise your clu is hostage to just one SAN path to LOCK disk.

 

KEYWORDS:

Serviceguard for Linux SAPeSG Service Guard SuSE

Value specified for CLUSTER_LOCK_LUN at line is too long. Its length should not exceed characters