Re: node2 down; cluster lock not activated; DLPI error

S.N.S · ‎03-26-2010

Hi Folks,

Some good advice needed.
The node2 of the cluster (all are 11.23 IA, HP SG 11.18) is down.

When I try to
vgisplay -v /dev/vg _lk
vgdisplay: Volume group not activated.
vgdisplay: Cannot display volume group "/dev/vg_lk".
vg_lk is the lock disk. I also have
the DLPI error :DLPI error ack for primitive 11 with 8 0

Can you good people guide?

Merci/Dunke
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

rariasn · ‎03-26-2010

Hi,

# vgchange -a e /dev/vg_lk

# vgdisplay -v /dev/vg_lk

rgs,

Rita C Workman · ‎03-26-2010

How many nodes does your cluster have?

You say node2 is down....but is your cluster down?

If it is only a single node in a multi node cluster, then that node may have an issue with seeing the lock disk. Remember-only one node gets the lock_disk, but all need to have the ability to see the lock disk in the event of a failover. Which ever nodes gets it first - they become the owner (i.e. exclusive rights) to that disk.

Rita

rariasn · ‎03-26-2010

Hi,

How to verify cluster lock is working?

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=34030

rgs,

S.N.S · ‎03-26-2010

Thank you all for the very swift reply..

Rita, the Cluster is up - running on a single node as of now; only node 2 is down

CLUSTER STATUS
scocl up

NODE STATUS STATE
sco1 up running

PACKAGE STATUS STATE AUTO_RUN NODE
pkg1 up running enabled sco1

NODE STATUS STATE
sco2 down unknown

However, both node 1 & node 2 shows the same message:
vgdisplay: Volume group not activated.
But on sco1, the syslog doesnt show any issue. Rather:
Feb 22 11:33:50 sco1 cmclconfd[13377]: Querying volume group /dev/vg_lk for node sco1
Feb 22 11:33:50 sco1 cmclconfd[16801]: Querying volume group /dev/vg_lk for node sco1
Feb 22 11:33:50 sco1 cmclconfd[16801]: Volume group /dev/vg_lk is configured exclusive
Mar 11 14:50:48 sco1 LVM[10725]: /usr/sbin/vgexport -s -p -m /etc/lvmconf/vg_lk.mapfile /dev/vg_lk

Even if the lock disk would be with a single node (here sco1 is primary); the vgdisplay should work - shared disk- am I right?

And, is the DLPI error anyway related to this?

Can you good ppl throw some light?

Good that HP has ITRC; the GSCs & GCCs would have less traffic :-)...

Dunke/Merci,
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

Rita C Workman · ‎03-26-2010

Think of playing the musical chairs game. When the music stop the first person to grab the chair gets to keep it. The lock disk is that chair that every nodes wants.
Every node must be able to see the lock disk, but only the first node to the sit on it-gets it! So the lock disk then becomes exclusive to that node. It got the lock disk and it is the only one to sit on it.

Now, if and when that node goes down - the lock disk is up for grabs again to the first node who can grab it.

I like to illustrate, so hope my little tale of the lock disk (musical chair) helps. In technical terms the lock disk is what grants quorum so the cluster can form. It is only granted to one node at a time, and strictly on a first come basis.

Rita

melvyn burnard · ‎03-26-2010

quite simply put, although the cluster lock disk HAS to be in an LVM VG an dmust b shared, that VG does NOT have to be activated to use the Cluster Lock mechanism.
Therefore you could conceivably se the vgdisplay failing.
There are a number of sites I know of who have a small LUN as their CL disk, in a VG, and that VG is NOT part of any package so it NEVER gets activated.
Check and see if that VG is in any of your packages.

And a DLPI error is normally an issue with networking

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Rita C Workman · ‎03-26-2010

DLPI stands for Data Link Provider Interface.

Could you put a more detail output of the DLPI message....it looks like we just one little piece of it in your post. Need the full picture for us to respond.

Thanks,
Rita

S.N.S · ‎03-26-2010

Thank You, Rita and Melvyn - I will be back after checking the server on Monday!

Appreciate the example, Rita - nice.
And Melvyn, experience speaks volumes.

I think You both need to be assigned more than 7pts, so am keeping the assigning on hold till Monday.

And on the DLPI, I think I know the reason - and since its not connected as per You gurus, let me see if I can fix on Monday.

Will keep You posted.

Bon Weekend

SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

S.N.S · ‎03-29-2010

Hi,

Melvyn was right on the dot - the cluster works even when the vg_lk isnt activated.

So, will that be the case for larger system; or will it only depend if the lock disk is activated by the package?

As for the DLPI error - the pblm started when the LAN card was replaced - it waa lan 1; now it is lan 10; I had changed in /etc/rc.config.d/netconf - but the cmgetconf still says lan1.

This even after the cluster config file was edited (or may be the wrong file was edited -since there seems to be multiple files with confusing name).
Here are the related errors from syslog of node2 :

cmnetd[4224]: Assertion failed: NULL != element, file: netsen/cmnetd_ip_hpux.c, line: 1350

cmclconfd[2020]: DLPI error ack for primitive 11 with 8 0
cmclconfd[2020]: Unable to attach to network interface 1
cmclconfd[2020]: Unable to attach to DLPI: I/O error

cmcld[2052]: Service cmnetd terminated due to a signal(6).
cmcld[2052]: Utility Daemon cmnetd died unexpectedly! It may be due to a pending reboot or panic
cmcld[2052]: Exiting with status 1.
cmsrvassistd[2072]: Lost connection with Serviceguard cluster daemon (cmcld): Software caused connection abort
cmclconfd[1980]: The cluster daemon aborted our connection (231).
cmclconfd[2026]: The Serviceguard daemon, cmcld[2052], exited with a status of 1.

Details of syslog attached..

Merci
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

S.N.S · ‎03-30-2010

Hi All,

I may need some more help here...

When I tried to cmapplyconf to the clusterconfig.ascii file:

Detected a partition of IPv4 subnet 192.168.220.0.
Partition 1
sco1 lan1
Partition 2
sco2 lan10
Failed to evaluate network
cmapplyconf: Unable to reconcile configuration file socbencl.ascii
with discovered configuration information

Ok, this has to be the networking DLPI mismatch..

Fine, the 192.168.220.0. subnet is only the secondary LAN/HB network..Why should node2 - sco2 not be in the cluster when the primary n/w is up and running- and the only issue is with the secondary 192.168.220.0. subnet ?

Highly appreciate your inputs.

Merci,
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

Rita C Workman · ‎03-30-2010

As was explained to me when I first started working on clusters...when you are building the cluster - EVERYTHING must be perfect.

SG is very picky when you're doing the build. So, if it's complaining about some HB, then take a look at that. You seem to have a good handle on doing network work.

Now for:
"...Melvyn was right on the dot - the cluster works even when the vg_lk isnt activated. So, will that be the case for larger system; or will it only depend if the lock disk is activated by the package? "

>>>When your cluster grows, you might consider switching from a lock disk to a quorum server. I find them much easier with less problems. Check it out:

http://docs.hp.com/en/B8467-90048/ch01s03.html

Kindest regards,
Rita

S.N.S · ‎03-30-2010

Hi all,

Thank you Rita, your post is good to read- embedded technically, but always very well phrased.

Very true, SG is very picky;
some history for you folks:

I wasnt there when the cluster was setup; and (un)fortunately - I come to a situation wherein I was told that cluster is running on node 1 (primary)only; node2 is down due to LAN card failure. Now for a properly designed cluster, no single LAN card failure would make node2 inaccessible to the Cluster itself -especially when primary LAN is up on the node2.
And the admins here say that the cluster never worked.
Now, the question - allow me to rephrase - how can the cluster be dependent on the secondary LAN/HB of any node?
The Primary LANs for both are Up and running.

lan2 - Primary LAN in both nodes
lan 1 - LAN/HB subnet in Node 1
lan 10 - LAN/HB subnet for Node 2
lan 3- Dedicated HB n/w for both nodes
lan 4*- standby LAN

Am I missing something fundamental here, folks?

Highly Appreciate your time & inputs,
Thank You all very much,
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

Mike Chisholm · ‎04-02-2010

Serviceguard expects ALL lans to be working at the time the cluster starts. This is by design. It can tolerate failure of all but one of the HB lans after the cluster is up, but at cluster start time(or node join time in this case), it requires all lans to be working.

It would appear that the lan card was replaced incorrectly for a HA configuration since the instance number changed. This has confused Serviceguard. You will need to edit the cluster ascii file and change the instance numbers to reflect the actual current configuration. You may need to halt the cluster to make this change as I am not sure if cmapplyconf is going to let you do this online.

Proper procedures to replace hardware in a cluster is documented in the SG manual in Chapter 8 - see http://docs.hp.com/en/ha.html#Serviceguard Note that SG continues to expand a bit what you are allowed to do online vs offline but since this is already confused, you should really be prepared to halt the cluster to fix it and make sure you have enough of a time window to absorb some unexpected problems because if something goes weird, you may not be able to start the cluster at all on either node until you get if sorted out.

Michael Steele_2 · ‎04-02-2010

This problem has been exactly documented by HP with a bug report and patch 11.23 MC/SG 11.18.

http://www13.itrc.hp.com/service/patch/patchDetail.do?patchid=PHSS_40363&sel={hpux:11.23,}&BC=main|search|

Search on cmnetd under patch keywords for 11.23.

Support Fatherhood - Stop Family Law

S.N.S · ‎04-03-2010

Thank you Mike & Michael for your replies.

I have heard you - very good presentation.
I was in an HP GSC myself some time back.

I had initially looked at node2 only. The linkloop test (cmscancl makes it easier) showed in node 1; the lan 1 - which makes the local LAN/HB network can see itself.
Now this adds to the HP SG network woes!

linkloop -i 1 0x001A4B06F293
Link connectivity to LAN station: 0x001A4B06F293
error: expected primitive 0x30, got DL_ERROR_ACK
dl_error_primitive = 0x2d
dl_errno = 0x04
dl_unix_errno = 57
error - did not receive data part of message

I need to fix lan1 of node 1st - yes, I have the answer for the error 57;
thereafter cmquerycl; and cmcheckconf the ascii file from the cmquery.

So my question would be: why is that, when lan1 was down at the linkloop level; the HP SG did start, throwing no errors in syslog of node1 ? - it clearly shows the needed bridge net:

Apr 2 17:14:28 sco1 cmcld[11370]: lan1 0x001a4b06f293 192.168.220.1 brid ged net:1

Meaning SG can start the cluster (using the cmrunnode -n node1 Only)even if the lan1 is down? [lan2 is the primary public lan]

I would like to confirm this, pls.

It is a networking issue for sure now.
The cluster lock had nothing to do with it.

The DLPI error would be 99% solved by if the lan1 is OK.

I am not closing this thread as of now: I would my HP folks to know how it was finally resolved,

The solution is very near.

And you ppl are great - and I thought that only the Linux folks were coolest!

Cheers,
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

Michael Steele_2 · ‎04-05-2010

Thank you Mike & Michael for your replies.

a) linkloop is a level two test of the physical mac address. If you have not physical acknowledgment then you have no physical connection. Get it? Check you cable. Swap you cable. Verify the nic's are up by ping their ip addresses.

b) the patches provided were for corrections between ipv6 and ipv4 exchange of 4 and 6 byte ip addresses?

b1) have you installed the patches?
b2) do you have ipv6 turn on somewhere accidentially

Support Fatherhood - Stop Family Law

S.N.S · ‎04-06-2010

Issue resolved:

Here is how -

The LAN/HB network wasnt present - and so the HP SG didnt see node2
The initially configuration had lan1 of both nodes this LAN/HB.
On node 2 - there a lan card failure; and when it was replaced - its wasnt put back in the same position as the faulty lancard.
But here is the fun part - all along it was thought issue was due to lan or someother failure on node 2...
But...
On further analysis the cmscancl o/p - I saw that the lan1 of node1, yes node 1, doesnt see itself...linkloop fails with the same error 57.
Now lan5 on node1 and lan0 on node2 can linkloop with themselves (locally). So, instead of setting lan1 on both node right -

1. I configured the The LAN/HB network (subnet) with lan5 and lan0..
Then I halted the package - cmhaltpkg & then the cluster - cmhaltcl,
2. got the new cluster config file - cmquerycl -C XYZ -n node1 -n node2
3. editted the config file for the giving the VG and PV lockdisk entries; and the shared VGs
[I had compared with the running config file obtained using cmgetconf]
4. cmcheckconf --- it succeeded
5. cmapplyconf -C XYZ --succeeded >>>>>
6. restared the cluster
And it worked!!!
Thanks to all you out there...
GOD bless Us all
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

S.N.S · ‎04-12-2010

Closing the thread folks!

Cheers!
SNS

"Genius is 1% inspiration, 99% Perspiration" - Edison

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: node2 down; cluster lock not activated; DLPI error

node2 down; cluster lock not activated; DLPI error