topic Re: Problem with Cluster in Operating System - OpenVMS

Problem with Cluster

Vladimir Fabecic — Tue, 17 Oct 2006 10:19:46 GMT

Hello
Configuration: 2 node cluster (ES47, ES45), OS version is 8.2, 2X storage MA8000, gigabit eth cluster interconnect
First node (ES47: 4 CPU 24 GB RAM) is running 13 Oracle instances.
Problem is that this machine hangs about 9:00 AM whem workload reaches top. Second node is currently diong nothing (some databases were not created yet) and is working fine.
When this node is rebooted, it is working fine until next day 9:00 AM (just four instances are active 24 hours a day).
Before creating cluster this machine worked fine as standalone. It even worked OK one day as single node cluster (before adding second node to cluster). As single node, it worked with 10 HSG disks, now is working with 28 HSG disks
No crash dump file, nothing in operator.log.
From where to start dubuging?
What system parameters should checked?

Re: Problem with Cluster

Robert Gezelter — Tue, 17 Oct 2006 10:24:49 GMT

Vladimir,

My first question is: A total freeze, or Oracle and the applications freeze (and you retain access from terminal windows)?

I would recommend running T4 (see the OpenVMS www site) and collecting and analyzing the resulting data. You could be running out of something, but there are many possibilities.

Also, I would consider if I can force a crash dump manually. Analysis of the dump file should show what is hung on what (presuming that it is an OpenVMS problem and not a problem within the application or Oracle).

- Bob Gezelter, http://www.rlgsc.com

Re: Problem with Cluster

Vladimir Fabecic — Tue, 17 Oct 2006 10:51:59 GMT

Hello Bob
Thants for fast reply. It is not total freeze. It even opens new terminal in X but does not give $ prompt. Looks like it is running out of something. Since it is production envirement, I must react fast. I will do some monitoring tomorow. I know there are many possibilities, but what would be your first guess?

Re: Problem with Cluster

Volker Halle — Tue, 17 Oct 2006 10:56:20 GMT

Vladimir,

same questions as Bob: what is 'hanging' ?

- can you still do a PING node ?
- can you login via TELNET, LAT, DECnet ?
- how do you 'reboot' that node (just hitting restart-switch) ?
- what does a SHO SYS/NODE=xxx show if issued from the other node when the first one is 'hung' ? Any processes in RW* state ?

Volker.

Re: Problem with Cluster

Robert Gezelter — Tue, 17 Oct 2006 10:58:47 GMT

Vladimir,

Locks, pool, and various quotas come to mind.

Getting a fairly comprehensive T4 output would be helpful.

I would also consider if the problem gives warnings in the hour or so before the freeze actually happens. I would also check if somebody is doing some automated process at or about the time of the freeze. I would also hook up one or more network sniffers to the applicabale network connections to monitor traffic to/from the node (Wireshark, the successor to Ethereal, is available as a free download, so having multiple monitors should not be a problem)_.

- Bob Gezelter, http://www.rlgsc.com

Re: Problem with Cluster

Volker Halle — Tue, 17 Oct 2006 11:01:56 GMT

Vladimir,

if there is nothing in OPERATOR.LOG, please also watch the console terminal for any messages (Mount-verification ?)

If you don't get a $ prompt, you're likely to hit the RESTART button to reboot your system. Try HALT button and >>> CRASH instead. It will take some time to write the dump, but that will probably be the only way to find out what's wrong.

Try logging in using Username: xxx/NOCOMMAND to skip your login-procedures, they may hang due to some problem.

Try to keep a terminal logged in before 09:00 AM to be able to look around once the problem hits.

Volker.

Re: Problem with Cluster

Andy Bustamante — Tue, 17 Oct 2006 11:11:58 GMT

Are there any console messages displayed? Assuming a quorum disk, is there any production I/O on the quorum disk?

You state gigabit ethernet cluster interconnect, is this dedicated to cluster traffic or does it share application traffic and cluster traffic?

There was a similiar behavior with 7.3-2 and TCPIP 5.4 corrected in ECO4 if I recall correctly. Are you current with TCPIP ECOs? This may or may not be an issue in TCPIP 5.5 included with VMS 8.2.

Andy

Re: Problem with Cluster

Robert Gezelter — Tue, 17 Oct 2006 11:12:41 GMT

Vladimir,

Also consider opening a SYSMAN session on the other cluster node, with a SET ENVIRONMENT to the node that is failing.

I have seen situations where terminal sessions were useless, but the SYSMAN session remained usable.

- Bob Gezelter, http://www.rlgsc.com

Re: Problem with Cluster

Vladimir Fabecic — Tue, 17 Oct 2006 11:45:25 GMT

PING can be done. Telnet gives timeout. 'Reboot' is done with restart switch. Console terminal was not turned on so no messages.
I got these informations from customer.

Gigabit ethernet cluster interconnect is dedicated to cluster traffic. There is no production I/O on the quorum disk. All newest patches are installed including TCPIP 5.5 ECO1.
Tomorow I will do some monitoring as suggested by Valker and Bob.
If no other way I will force crash the day after.
Terminal will be connected to Reflection session so everithing will be logged.
I do not think it is Oracle problem because nothing has been changed in Oracle software.
Looks like a parameter (or quota) problem to me, but I will have much more informations tomorow.
Guys, thanks a lot for helping me.

Re: Problem with Cluster

labadie_1 — Tue, 17 Oct 2006 11:52:03 GMT

From the node still working, do a
sh sys/node=other, to see if you have many process in "interesting" states (rwxxx, mutex...).

As said before, try a
mc sysman set env/node=other
do any command

If a login fails after the username, this can mean pagedyn is too low.

Take a crash, you will have something to analyse

The best advice: install Amds or Availability Manager, you will have all the good data available to know what is going wrong.

Re: Problem with Cluster

Volker Halle — Tue, 17 Oct 2006 12:28:27 GMT

Vladimir,

if PING works, but TELNET gives a timeout, could it be a process creation/scheduling problem ? A high PRIO looping job preventing any other processes to receive any CPU time ?

Volker.

Re: Problem with Cluster

labadie_1 — Tue, 17 Oct 2006 14:26:20 GMT

after the reboot, check in the (previous) operator.log messages such as
pagefrag
pagecrit
noslot
no pcb available

Good hunt

Re: Problem with Cluster

Albert Öttl — Tue, 17 Oct 2006 14:46:09 GMT

Hi Vladimir,
did you consider a lock tree remastering?

What are the values for the SYSGEN parameters
LOCKDIRWT and PE1 on both nodes?

Regards,
Albert

Re: Problem with Cluster

EdgarZamora_1 — Tue, 17 Oct 2006 15:56:17 GMT

Had a similar problem. Check your locks. HP changed some memory locking stuff in 8.2. If your locking rate has become excessive, install SYS500 and UPDATE400. There is a patch that reverts the behavior back to 7.3-2 for locking pages.

Hope that helps!

Re: Problem with Cluster

Joseph Huber_1 — Wed, 18 Oct 2006 02:28:27 GMT

In addition to all the monitoring stuff, did You do a simple AUTOGEN with feedback since the upgrade to more than double the number of disks/HSGs ?
It may show some of the parameters to adjust.

Re: Problem with Cluster

Jan van den Ende — Wed, 18 Oct 2006 03:00:26 GMT

Vladimir,

what is the 'typical' behavior?
Gradual slowdown till things stop, or all going normal until 'sudden death'?

So many questios asked already, I guess the right one is there, but you need facts to decide which one.
I like Joseph's AUTOGEN idea. It could give a lot of info, even before you run stuck again.

Good hunting!

Proost.

Have one on me.

jpe

Re: Problem with Cluster

Vladimir Fabecic — Wed, 18 Oct 2006 05:44:48 GMT

Hello guys
I think I found the reason of problem. Yesterday I doubled CHANNELCNT parameter and everything is working fine so far.

Some answers to question:
what is the 'typical' behavior?
Behavior was all going normal until 'sudden death'.
did you do a simple AUTOGEN with feedback ?
I did. There was nothing about CHANNELCNT.
LOCKDIRWT and PE1 are set to 0 on both nodes

I will try to schedule some downtime to encrease NPAGEDYN and NPAGEVIR because of new database instance.

Again, thanks a lot for your help and time.

Re: Problem with Cluster

Robert Gezelter — Wed, 18 Oct 2006 06:07:32 GMT

Vladimir,

At first glance, that would certainly appear to be able to produce the symptoms that you described.

I would also recommend checking other paramters which may be close to a problem area. It is hard to come up with a solid rule, but I would take a look at everything that is over 50-60% (since presumably, this is to become a two node cluster).

- Bob Gezelter, http://www.rlgsc.com