Re: Oracle 10gR2 RAC + HP-UX 11.11: IPC Timeout

ricardor_1 · ‎10-09-2008

Hello there,

I´ve been having a rough time with a two node Oracle 10gR2 (10.2.0.4) RAC (no SGeRAC) running on HP-UX 11.11.

Sometimes (the problem is intermittent and does not seem to be related with the database load, but is very frequent - we even disabled the second instance) one (and only one) of the instances abort with "IPC timeout" errors, which follow below:

IPC Send timeout detected. Receiver ospid 9342
Tue Sep 23 19:30:46 2008
Errors in file /dbs/trace/snoffprd/bdump/snoffprd2_lms1_9342.trc:
Tue Sep 23 19:30:48 2008
Trace dumping is performing id=[cdmp_20080923193048]
Tue Sep 23 19:30:48 2008
Waiting for clusterware split-brain resolution

We have two databases on the same Cluster, only one of them suffers from this problem.

Oracle suggested we changed from a crossover setup to a gigabit switch, mentioning crossover interconnects were not supported. The case is still open and they´ve sent the problem to their development team.

It does not seem to be a physical media problem, since we have two NICs, which we tested. We also changed the MTU from 9000 to 1500 and vice-versa without success.

Have any of you seen anything like this? This is happening since we migrated the second instance into the Cluster. The first one has been running smoothly for several weeks, since it went to production.

Duncan Edmonstone · ‎10-09-2008

Well Oracle support are correct - they *don't* support crossover cables - How can you configure something as critical as a RAC cluster in an unsupported configuration?

Resolve the "supported" issue first, then you can look at this problem if it doesn't go away when you get to a "supported configuration"

HTH

Duncan

I am an HPE Employee

ricardor_1 · ‎10-09-2008

Of course we have already installed the gigabit switch interconnect, and the problem persists.

Also, weÂ´ve been running 9i RAC on the very same setup for several years. There is no sense blaming the crossover setup since it really doesnÂ´t seem to be a physical media problem. Anyway, weÂ´re aware crossover might have issues on autonegotiation, so we followed OracleÂ´s advice.

Oracle mentored the whole migration proccess and none of their personeel complained about the crossover setup.

patrik rybar_1 · ‎10-09-2008

when we had same situation on the tru64 and 10g r2 everything were solved by adding ram and cpu's because all that 'split-brain' things were caused by load (of course your your situation might be complete different)

Duncan Edmonstone · ‎10-09-2008

I didn't say that the crossover cable was the problem - merely that you can't expect a vendor to look at an issue when you're operating in an unsupported configuration. It's good that its fixed.

I won't comment on Oracle not mentioning the crossover cable during your migration.

So what does CPU utilisation look like at the point in time that you have the issue? Were the systems heavily loaded? (you need to look at both). Your particularly looking for a lot of sys mode utilisation.

Th problem as I see it with a pure Oracle cluster stack on any platform apart from Linux is that the clusterware operates completely in user space and as such the subsystems that handle "hung node detection" are always going to be more flakey than those in a product like Serviceguard which has access to kernel routines for this sort of stuff. The CRS processes need to get CPU time within certain boundaries to ensure that they can respond to heartbeats etc from other nodes. This means they usually run at a real-time priority - which can mean they effectively end up tied to just one processor. If that processor is busy doing something in kernel space, then you end up with these sorts of issues. On Linux, Oracle are able to introduce a kernel module (the hangcheck timer) to resolve this, and Serviceguard is able to do something similar on HP-UX.

So what to do... obviously you need to continue to pursue this with Oracle support as these sorts of issues are very complex (I doubt you'll get a fix on these forums as we're usually looking at the internals of Oracle CRS and the HPUX kernel)but in the meantime I would look at bringing myself as up-to-date as possible on OS patches - particularly kernel patches as anything that fixes kernel issues which caused large amounts of SYS cpu time could resolve the issue (e.g. spinlock contention).

Apart from that - the suggestion of more CPU/memory is of course a good one as that will reduce the chance of the event happening.

HTH

Duncan

I am an HPE Employee

ricardor_1 · ‎10-10-2008

patrik and Duncan,

Thank you for your answers.

We have observed no clear correlation between the split brains and the systemÂ´s load. Although this is a rather busy cluster, one of the machines (running 2 databases) is 70% idle on average and the other (in which we disabled the second instance) is now 95% idle.

The CRS processes are running with the default nice (20). We do now know clearly whatÂ´s the effect of lowering it, since we were told (although not why for certain classes of processes like (ora|asm)_lms*) that changing the nice of Oracle processes is not recommended.

WeÂ´re in contact with Oracle, and any progress we make I will update this thread.

Thank you again.

ricardor_1 · ‎10-10-2008

Oh, I forgot to mention. We have 5 minute sar statistics and the %sys load is very reasonable, when most of the brain splits ocurred, %sys loads were about 5%.

ricardor_1 · ‎10-10-2008

Having thought for a while, there's another important point it's probably not very clear in my messages above.

We have 2 databases on both machines (4 instances). The two databases are rather balanced in terms of load. Only one of them crashes. The other keeps running smoothly. This not a cluster-wide brainsplit.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Oracle 10gR2 RAC + HP-UX 11.11: IPC Timeout

Oracle 10gR2 RAC + HP-UX 11.11: IPC Timeout