Operating System - HP-UX
1827303 Members
3071 Online
109961 Solutions
New Discussion

Crashdump and HeartBeat with MC/SerivceGuard

 
SOLVED
Go to solution
Tsuyoshi Shiokawa
New Member

Crashdump and HeartBeat with MC/SerivceGuard

Could anybody tell me any good idea ?
Now my system is running Oracle 9i RAC(Real Application Cluster) with MC/ServiceGuard 11.14.(2node cluster) And we have two hearbeat LAN with cross cable(1000Base-SX).
When we test for heartbeat-fail that we disconnect heartbeat LAN, Node2 was down.
So far it seemed to be good.
However, We found /var/adm/crash was broken after Node2 was boot.
In more detail, we failed to mount /var/adm/crash with /dev/vg00/lvol10.
So we have to do "newfs -F vxfs /dev/vg00/rlvol10", and we found mounting
was succeeded.
We never face to such a case.
We tested four times with same operation,
but it was the same every time when we disconnect heartbeat LAN.
We cannot find out what was wrong.

Thanks.
9 REPLIES 9
Helen French
Honored Contributor

Re: Crashdump and HeartBeat with MC/SerivceGuard

Some thoughts:
1) Is /var/adm/crash a seperate FS? Or is it just a directory under /var file system?
2) Did you check the crash file (crash analysis) in /var/adm/crash? Does it says anything? Is it so big that cannot be included in the FS?
3) Post the exact error message you are getting.
Life is a promise, fulfill it!
Rajeev  Shukla
Honored Contributor

Re: Crashdump and HeartBeat with MC/SerivceGuard

There seems to be something wrong with the way you have configured standby heartbeat network. For some reasons its not failing over thats the reson why when you pull out the heartbeat the other node does a TOC and you get a crash dump.
Can you post your cluster configuration script to see how you have configured the standby heartbeat.

Cheers
Rajeev
Tsuyoshi Shiokawa
New Member

Re: Crashdump and HeartBeat with MC/SerivceGuard

Thank you for some advice.
And I'm so sorry that it was NOT enough information what I asked.
So, I answer and show some information.

1. Is /var/adm/crash a seperate FS?
Yes it is.
/dev/vg00/lvol9 /var
/dev/vg00/lvol10 /var/adm/crash
and "/var/adm/crash" has the same size of physical memory.

2.Error message
When we checked /etc/rc.log, we found some messeage as follows.

Save system crash dump if needed
Output from "/sbin/rc1.d/S440savecrash start":
----------------------------
savecrash directory not set; defaulting to: /var/adm/crash
savecrash: savecrash running in the background
EXIT CODE: 4 - savecrash proceeding in background
"/sbin/rc1.d/S440savecrash start" FAILED

But, we checked "lvlnboot -v", and it seemed to be no problem.
# lvlnboot -v
Boot Definitions for Volume Group /dev/vg00:
Physical Volumes belonging in Root Volume Group:
/dev/dsk/c1t0d0 (0/0/1/1.0.0) -- Boot Disk
/dev/dsk/c2t0d0 (0/0/2/0.0.0) -- Boot Disk
Boot: lvol1 on: /dev/dsk/c1t0d0
/dev/dsk/c2t0d0
Root: lvol3 on: /dev/dsk/c1t0d0
/dev/dsk/c2t0d0
Swap: lvol2 on: /dev/dsk/c1t0d0
/dev/dsk/c2t0d0
Dump: lvol10 on: /dev/dsk/c1t0d0, 0

3. Cluster configuration script
NODE_NAME node1
NETWORK_INTERFACE lan1
HEARTBEAT_IP 172.16.247.185
NETWORK_INTERFACE lan4
HEARTBEAT_IP 172.16.247.189
NETWORK_INTERFACE lan5
STATIONARY_IP 172.16.247.5

FIRST_CLUSTER_LOCK_PV /dev/dsk/c4t0d0

NODE_NAME node2
NETWORK_INTERFACE lan1
HEARTBEAT_IP 172.16.247.186
NETWORK_INTERFACE lan4
HEARTBEAT_IP 172.16.247.190
NETWORK_INTERFACE lan5
STATIONARY_IP 172.16.247.6

We do not have alternate Data LAN.
Our customer approved it.

As for crash dump, there were so many files and folders, so we could not find out any messages what we should show.

Thanks.
melvyn burnard
Honored Contributor

Re: Crashdump and HeartBeat with MC/SerivceGuard

Once more I post our note on using Crossover lan cables in an SG cluster:

We often get questions asking whether Crossover cables are supported for use in a ServiceGuard cluster. The short answer is YES, but there are some important issues that you should be aware of:

This solution only works in a two node cluster. There is no way to have a Standby LAN card when using a Crossover LAN cable.

When either LAN card fails, or the crossover cable is disconnected, both LAN cards go down. This is because the electrical signals necessary for the cards to determine that a valid LAN connection exists are not present. The result is that since both nodes appear to have a bad LAN card, ServiceGuard may TOC the wrong node. If a hub was used between the two LAN cards, then the hub would provide the electrical signals to the other card, allowing it to stay up.

On multi-speed cards, such as 10/100Base-T, the cards must negotiate which speed will be used when the system boots up. If only one system is booted and the remote system is down, then the negotiation will fail, and the card will not be enabled at all. So when the second node eventually comes up, it's LAN will also be down. If a hub is used, then the negotiation will succeed, so the LAN cards will come up at bootup, even if only one node is running.
It may be possible to force some multi-speed LAN cards to bypass the negotiation at bootup and to use a predetermined fixed speed. If this is possible, then would allow the two systems to boot up at different times and still use the Crossover cable connected LAN cards once they are both booted up.

Since both cards may go down when there is a failure when a Crossover cable is used, it can be difficult to determine where the problem lies. Another problem using Crossover cables is that if they are not properly labeled, they may accidently be used in situations where they will not work.
For the reasons listed above, HP does not recommend using Crossover cables for ServiceGuard configurations. However, they are still supported as long as you are willing to accept the above limitations. Using a Crossover cables is cheaper than using a hub, but it compromises the HA solution.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
melvyn burnard
Honored Contributor

Re: Crashdump and HeartBeat with MC/SerivceGuard

I also note that you appear to have all htree network interfaces on the same subnet.

Please post your netstat -in output.
Also, you say the problem is the same every time. Which problem? The fact that a node crashes? or that it loses the file system?

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Jean-Louis Phelix
Honored Contributor

Re: Crashdump and HeartBeat with MC/SerivceGuard

Hi,

You problem is clearly not related to Service Guard ... You would have the same disaster with any TOC or PANIC. When you want to get a dump you need to have :

- a dump volume, used as a raw device at saving time to copy memory content. This can be a swap device but it could slow down the reboot, so I better use a dedicated lvol

- a filesystem to get the image as files at reboot

These devices CAN'T be the same. So each TOC uses your /dev/vg00/lvol10 as a dump device (and destroy the filesystem) and at reboot, dump can't be saved.

I think that you've configured /var/adm/crash after lvlnboot -d else it would have been refused.

The solution is to create 2 separate lvols or configure your swap as a dump device.

Hope this helps
It works for me (© Bill McNAMARA ...)
Tsuyoshi Shiokawa
New Member

Re: Crashdump and HeartBeat with MC/SerivceGuard

Thank you for valuable comment about hub-connect.
We have already recommend heartbeat should be connected with hub and prepare alternate DATA LAN.
However our customer said they could NOT pay
and approve.

By the way, regarding "netstat -in",
Of course, we separate network segments with subnet mask as follows.

lan1 and lan4 are "hearbeat LAN".

------ Output of netstat -in (node1) ------

Name Mtu Network Address Ipkts Opkts
lan3 1500 172.16.247.176 172.16.247.177 371915 401603
lan2 1500 172.16.247.32 172.16.247.36 623902 550509
lan5:1 1500 172.16.247.0 172.16.247.8 0 0
lan9 1500 172.16.247.180 172.16.247.181 371090 400735
lan1 1500 172.16.247.184 172.16.247.185 587419 532177
lan0 1500 172.16.247.64 172.16.247.78 407238 442279
lo0 4136 127.0.0.0 127.0.0.1 89712 89712
lan5 1500 172.16.247.0 172.16.247.5 1115549 855003
lan4 1500 172.16.247.188 172.16.247.189 587938 532532

Furthermore, "everytime" means that
"When we pull out two heartbeat LAN cable".
Sorry to tell not enough information.

Thanks.
Solution

Re: Crashdump and HeartBeat with MC/SerivceGuard

Here's the problem:

Dump: lvol10 on: /dev/dsk/c1t0d0, 0

You have your dump device set to the lvol which you are using for the filesystem /var/adm/crash.

When HPUX dumps core, it writes it to a raw device NOT to a file system, then when the system reboots the savecrash command takes care of moving the dump off the raw device to a file system (usually /var/adm/crash). Whats happening is that when your system TOCs, it is writing a crash dump to the raw lvol lvol10, and overwriting your file system headers and structure. So every time you get a crash, your having to recreate the filesystem.

You should tell HPUX to dump core to a raw device. Most people use swap for this, as there's nothing you need to keep in swap after a reboot, but you can create another raw logical volume to use solely as dump if you really want to. If you do create another device for use as dump then remember that it MUST be contiguous and have bad block relocation turned OFF (thats using the -C y -r N options on lvcreate). To tell HPUX to use a different dump lvol use lvrmboot and lvlnboot (you may need to do these in LVM maintenance mode). The following example sets the dumpt to go to swap, assuming this is a default install with swap in lvol2:

lvrmboot -v -d lvol10 /dev/vg00
lvlnboot -d /dev/vg00/lvol2
lvlnboot -R /dev/vg00

I would also share Melvyns concerns about your network config, and using X-over cables - particularly as you've gone to the expense of using 9iRAC! A couple of switches/hubs are *very* cheap in comparison

HTH

Duncan



I am an HPE Employee
Accept or Kudo
Tsuyoshi Shiokawa
New Member

Re: Crashdump and HeartBeat with MC/SerivceGuard

Thank you everyone, we tried the solution of Mr.Duncan's way. It was SUCCEEDED!!
Our team really appreciated.

Best regards and Thank you.