1846636 Members
1450 Online
110256 Solutions
New Discussion

Anyone ever seen this?

 
Craig Johnson_1
Regular Advisor

Anyone ever seen this?

Mixed 11.23/11.31 cluster, 14 nodes (seven of each). Migration to new 11.31 servers. cmapplyconf threw this:

/etc/cmcluster/apply.gen[3]: 10932 Abort(coredump)

I was trying to add new node names to about 10 packages and the apply blew up. I reduced it to just update two packages and it worked (whew!).

I think I ran into some sort of bug/limitation?
26 REPLIES 26
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

I should have noted that cmcheckconf worked fine.
John Bigg
Esteemed Contributor

Re: Anyone ever seen this?

Normally when cmapplyconf aborts there is an abort message. Was this re-directed somewhere? Without this there is not much that can be guessed.

There is a known problem with cmapplyconf aborting (although I expect this would affect cmcheckconf too) which is fixed in PHSS_41902 SG 11.19 and PHSS_41523 SG 11.20 which I have seen several times:

"Assertion failed:
(char *)tmp_vgd + copy_size <= (char*)lim + msg_length ,
file: config/config_lvm.c, line: 733"

but you would need the abort message to check. Looking in the patch catalog there are a few cmapplyconf abort issues so you should probably check for these too.
Viktor Balogh
Honored Contributor

Re: Anyone ever seen this?

Craig,

what version of SG do you have?

# cmversion
****
Unix operates with beer.
Emil Velez
Honored Contributor

Re: Anyone ever seen this?

You should not make changes to the cluster when you are in a mixed mode.

Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

A.11.19.00 on both 11.23 and 11.31 nodes.

We have definitely seen that other "assertion failed" error also. Our workaround was to not use the "-k" option to check/apply.
Stephen Doud
Honored Contributor

Re: Anyone ever seen this?

Page 13 of the Release Notes for A.11.19 at http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c02032073/c02032073.pdf
state:
"Support for Mixed-OS Clusters (HP-UX 11i v2 and 11i v3)
With some limitations, HP now supports Serviceguard clusters in which some nodes
are running HP-UX 11i v2 and some 11i v3."

Page 53 states:
"As of Serviceguard A.11.18 a cluster can contain a mix of nodes running HP-UX 11i v2 and 11i v3, with certain restrictions."

Page 55 documents the 'Rules and Restrictions for Heterogeneous Clusters'

None of the restrictions include Serviceguard configuration commands, so cmcheckconf and cmapplyconf are supported in mixed O/S clusters.

If you feel you are seeing a bug, you should open a call with the HP Customer Support Center to investigate this further.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

This project preceded me. The plan was hatched and verified with HP over a year ago. A couple months back the engineer assigned to the project was given a higher priority project and (guess who?) got this one?

Anyway, as far as this core dump is concerned, two things happened that may have caused it. First was that I tried to update too many packages at once. Secondly I got distracted and it say there waiting for me to answer "y" to the "Modify the cluster configuration?" question for about 10 minutes.

All I know is that reducing the number of package updates allowed me to continue.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

"sat there" not "say there"
John Bigg
Esteemed Contributor

Re: Anyone ever seen this?

I have to say that I am not aware of any problems associated with a large number of packages, or due to waiting a long time before completing the command. It would be interesting to see any command output (I would expect some) or a stack trace from the core file which would allow us to work out the abort. The abort is almost certainly an assertion. Otherwise I'd expect a SIGSEGV rather than an abort. I think we only ever abort on an assertion.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

How do I get the stack trace from the core file?
Steven E. Protter
Exalted Contributor

Re: Anyone ever seen this?

Shalom,

As an aside, it is quite complex to have 14 nodes in a cluster.

It invites complications.

Long range, you might want to break it up into more bite sized clusters. You are pushing the envelope. I'm sure it should work, but its gonna hurt.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Dennis Handly
Acclaimed Contributor

Re: Anyone ever seen this?

>How do I get the stack trace from the core file?

Make sure you have gdb then:
file core
(figure out path to executable)
gdb executable core
(gdb) bt
(gdb) q

>John: I think we only ever abort on an assertion.

That's what SIGABRT means.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

@Stephen: Yes it is painful but only temporary. Due to the sheer size of the original cluster we decided to migrate through mixed mode rather than move packages to a new cluster.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

$ file core
core: ELF-32 core file - IA64 from 'cmapplyconf' - received SIGABRT

ajohnsce@HP-UX:a300sua6 [ /home/ajohnsce ]
$ which cmapplyconf
/usr/sbin/cmapplyconf

ajohnsce@HP-UX:a300sua6 [ /home/ajohnsce ]
$ gdb /usr/sbin/cmapplyconf core
HP gdb 5.4.0 for HP Itanium (32 or 64 bit) and target HP-UX 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.4.0 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
Core was generated by `cmapplyconf'.
Program terminated with signal 6, Aborted.

#0 0x60000000c037a1d0:0 in kill+0x30 () from /usr/lib/hpux32/libc.so.1
(gdb) bt
#0 0x60000000c037a1d0:0 in kill+0x30 () from /usr/lib/hpux32/libc.so.1
#1 0x60000000c026fb70:0 in raise+0x30 () from /usr/lib/hpux32/libc.so.1
#2 0x60000000c03330f0:0 in abort+0x190 () from /usr/lib/hpux32/libc.so.1
#3 0x60000000ca970980:0 in cl_cassfail+0x240 ()
from /usr/lib/hpux32/libsgcl.so
#4 0x60000000ca6b6000:0 in cdb_add_applied_version_op_to_trans+0x1e0 ()
from /usr/lib/hpux32/libsgcl.so
#5 0x60000000ca709d80:0 in cf_configure_cluster+0x2de0 ()
from /usr/lib/hpux32/libsgcl.so
#6 0x4009cb0:0 in main () at cmd/cmd_config_apply.c:1029
(gdb) q
Dennis Handly
Acclaimed Contributor

Re: Anyone ever seen this?

>#4 in cdb_add_applied_version_op_to_trans

Now you need a domain expert to tell you what that could mean.
Michael Leu
Honored Contributor

Re: Anyone ever seen this?

I have seen this as well, when you confirm the cmapplyconf too late, it dumps core.
John Bigg
Esteemed Contributor

Re: Anyone ever seen this?

Ok, this is a new one on me and something which needs to be investigated. Please contact HP support and provide them with details of the problem along with the log files (syslog) and core file. It would be helpful to have the abort message from the command itself since there are several asserts withing that function although we could work out which one it is from the core file.

This is new code added in 11.19.

I'm not sure this is related to a delay between running the cmapplyconf and hitting "y". I left cmapplyconf at the prompt for an hour and it worked without error when I did complete the command.

> >John: I think we only ever abort on an assertion.
>
> Dennis: That's what SIGABRT means.

I was meaning that the only time that cmapplyconf aborts is when we hit an assertion. Therefore if we see a SIGABRT I would expect to see an assert message. The stack trace confirms this is an assert.
Dennis Handly
Acclaimed Contributor

Re: Anyone ever seen this?

>John: It would be helpful to have the abort message from the command itself

Yes, that should have been in the initial thread.

>Therefore if we see a SIGABRT I would expect to see an assert message.

Yes, unless some evil sysadmin does a "kill -SIGABRT" on the process. ;-)
Stephen Doud
Honored Contributor

Re: Anyone ever seen this?

You indicated that reducing the number of packages provided a workaround, and that you think the problem may be due to a delay between cmapplyconf and answering yes. Perhaps you have the opportunity to re-test, and if so, use the -f optuon with cmapplyconf, which avoids the interrogatory to answer, and see if you can modify all the packages at one go?
John Bigg
Esteemed Contributor

Re: Anyone ever seen this?

Strangely enough, we just had this same abort on one of our clusters! And the assert message was not output. So there appear to be two problems here. 1) The assert. 2) The fact the assert message is lost. In our case, simply repeating the command worked without any changes.
Craig Johnson_1
Regular Advisor

Re: Anyone ever seen this?

There was no "assert" error presented in this case. It simply said it was dumping core.

We do see the assertion error if we use the "-k" option.

I do have the core file and will send it to HP. My apologies for not having the output.
Dennis Handly
Acclaimed Contributor

Re: Anyone ever seen this?

>I do have the core file and will send it to HP.

Typically just a corefile is useless. You need to use gdb's packcore command.
John Bigg
Esteemed Contributor

Re: Anyone ever seen this?

Although packcore makes things easier, we can tell the version from the core file and pluck the exe and library from the relevant patch.
Dennis Handly
Acclaimed Contributor

Re: Anyone ever seen this?

>John: pluck the exe and library from the relevant patch.

More power to you then. Are you going to have the right versions for libc, etc?