Operating System - HP-UX
1752464 Members
5648 Online
108788 Solutions
New Discussion

Re: IP routing issues with ServiceGuard on HP-UX 11iv3

 
HEGP
Occasional Contributor

IP routing issues with ServiceGuard on HP-UX 11iv3

Hello folks,

 

I would like to submit an issue we have on our newest ServiceGuard cluster running on a pair of BL860 i2 blades.

This has been reported to HP but so far very little progress has been made. Since I really can't see what's so specific about our configuration that makes us have this problem, I would suppose that other have been bitten by this too and I would love to hear about what you've done to work around this.

Sorry, this is going to be a lengthy post because I need to provide all the relevant information. Please bear with me and many thanks in advance for those who will take the time to read it.

Any information you can provide on these issues, including "we have this too", is more than welcome.

 

-----------------------------------------------------------------------------
Problem 1: need to explicitly and manually set a default route for every ServiceGuard package running on the machine (except those packaging SRP containers) to maintain their network connectivity with other VLANs.
-----------------------------------------------------------------------------
A bit of background information first: these servers have 3 active LAN interfaces per server:
- lan2 in network 10.149.160.0/24: production traffic
- lan4 in network 10.149.247.0/24: administrative traffic
- lan5 in network 192.168.2.0/24: ServiceGuard hearbeat traffic

Initially, only one default gateway was defined to 10.149.160.254 in /etc/rc.config.d/netconf, hence on interface lan2.
ServiceGuard packages running on this machine have IP addresses in network 10.149.160.0/24 hence they come up as secondary lan2:X interfaces.

Such a configuration makes the administrative IP address 10.149.247.X unreachable from any of our IP networks (we have 10.149.0.0/16) except from 10.149.247.0/24 itself, unless we force ip_strong_es_model=2 in the network stack parameters, which is not the default configuration.

We used to do this on our two older clusters, but this one is supposed to host SRP containers which *require* ip_strong_es_model=1 so that's not an option.
Therefore we have added a second default gateway to 10.149.247.254 in the network boot parameters /etc/rc.config.d/netconf. This solves the network connectivity for the administrative (lan4) IP.
The heartbeat interface (lan5) is obviously not concerned. It doesn't get any traffic from outside of its own private network.

However, we have noticed that:
- SG packages running a SRP container work fine (no IP connectivity issue)
- "plain" SG packages (e.g. running an Oracle DB engine) are unreachable from any other network than 10.149.160.0/24. We need to explicitly add a separate default gateway for each and every package.

For example: assume a package whose IP adddress is 10.149.160.123, which comes up as lan2:1. Full IP connectivity can only be achieved if the following command is issued during package startup:
/usr/sbin/route add default 10.149.160.254 1 source 10.149.160.123
The new default route appears as follows in "netstat -rn":
default               10.149.160.254     UG    0    lan2:1     1500

The "normal" SG script that takes care of bringing up the package's IP address (namely /etc/cmcluster/scripts/sg/package_ip.sh) DOES NOT do this. However the additional scripts run when a package hosting a SRP container is started (/etc/cmcluster/package_name/srp_route_script) DOES do it. Therefore someone at must HP have realized that this was required, but why hasn't this been backported to the main SG scripts?

We've eventually resorted to patching package_ip.sh to add the needed default gateway at package startup and remove it at package shutdown, but this really is an ugly hack I'm not proud of.


-----------------------------------------------------------------------------
Problem 2: outgoing connections from applicative processes within SG packages (including the ones made from standard HP-UX commands such as remsh) have completely unpredictable source IP addresses
-----------------------------------------------------------------------------
This is an entirely new issue. No such behaviour has ever taken place on the two other BL860 clusters running HP-UX 11iv3. We observe that outgoing connections made from processes running within SG packages have unpredictable and changing source IP addresses. Since all packages have IP addresses within 10.149.160.0/24, we would expect the source IP address to be the one set to lan2 at boot. This CERTAINLY was the case on the older machines.
We observe that these IP addresses can be ANY of the addresses set on lan2 i.e. the address of any active SG package. It does vary over time too. Starting a SG package tends to make the source IP address for outgoing connections made by any process running on this machine "stick" to the address of the newly started package... until another one is started.

This makes things as maintaining .rhosts files on remote target servers getting remsh or rcp commands issued from a SG package completely impossible, since we have to account for every possible package becoming the source IP for the connections they get.

The only reply we've had so far from HP: "application processes need to be bound to their socket by their IP address only and not by ANY address" is completely unacceptable for many reasons such as:
- we're not going to hardcode the IP address assigned to the relevant SG package into the source of any of our applications
- for several applications, source code is no longer available or never was and /or they're HP-PA applications running under ARIES
- this also affects standard HP-UX commands such as remsh, rcp, ftp etc. We're not going to recode these, or are we?

 

Thanks for your time reading this,

Greets,

_Alain_

3 REPLIES 3
Laurent Menase
Honored Contributor

Re: IP routing issues with ServiceGuard on HP-UX 11iv3

For 1)

DB engine should run in a container.

HP-UX Containers (SRP) A.03.01 Administrator's Guide

 

1.6 Global view
When you enable HP-UX Containers on a system, all processes not executing within a containerexecute in the global view. Sessions logging in via the system console, or connecting via telnet or SSH to server IP addresses not assigned to any container, will execute in the global view.
The global view has no access restrictions and therefore can view and manage processes in the global view and all containers. System administration activity that you must perform in the globalview includes installing Software Distributor (SD) packaged software, device management, networkinterface management, setting kernel tunables, and executing system management utilities such as smh(1M), and srp(1M). You can perform file backup and recovery in either the global view or from within the individual containers.
As the global view has unrestricted access to system resources, HP recommends that you use the
global view for system management activities. Hosting of general pur pose application workload
should occur within a container.
 
2) it was already the case without SRP, but probability was lower.
HEGP
Occasional Contributor

Re: IP routing issues with ServiceGuard on HP-UX 11iv3

Thanks for your replies,

 

@Stephen: have you actually read this document? it deals with the creation of a SRP package and Oracle configuration, but specifically doesn't say much if anything about the IP routing issues.

 

@Laurent: you seem to imply that you can't have a mix of "regular" SG packages and SG-packaged SRP containers on the same server. Where do you get this information from? Our HP support folks certainly haven't complained the slightest bit about us having both on the same box...

As for the problem already being present although less likely before SRP packages, well, in this case there must be orders of magnitude of diffrences. Our SG packages have been running for quite a few years on the two other clusters with es_strong_model=2, making hundreds if not thousands of connections per day and we have *never* encountered such a problem. The source IP address of these outgoing connections has always been predictable and = the IP address of the server itself on that interface (in my example, the IP address of lan2).

Now it tends to take the IP address of the last package started on the server many timer per day.

 

This problem is already known to HP because SG daemons themselves can be affected as I've been told. Sometimes intra-cluster connections made by these are rejected by the target node because the source IP is not recognized as the one belonging to the source node. This can cause host panics due to safety timer expired. A local HP support folk told me that (quote) "an upcoming release of Serviceguard will address the issue by forcing daemons to bind their sockets explicity to the native IP address of the server instead of to 'any'" (end quote).

 

This document covers the routing issues quite nicely although it doesn't provide a solution.