Operating System - OpenVMS
1753835 Members
7424 Online
108806 Solutions
New Discussion юеВ

Re: Problem with Cluster

 
SOLVED
Go to solution
Vladimir Fabecic
Honored Contributor

Problem with Cluster

Hello
Configuration: 2 node cluster (ES47, ES45), OS version is 8.2, 2X storage MA8000, gigabit eth cluster interconnect
First node (ES47: 4 CPU 24 GB RAM) is running 13 Oracle instances.
Problem is that this machine hangs about 9:00 AM whem workload reaches top. Second node is currently diong nothing (some databases were not created yet) and is working fine.
When this node is rebooted, it is working fine until next day 9:00 AM (just four instances are active 24 hours a day).
Before creating cluster this machine worked fine as standalone. It even worked OK one day as single node cluster (before adding second node to cluster). As single node, it worked with 10 HSG disks, now is working with 28 HSG disks
No crash dump file, nothing in operator.log.
From where to start dubuging?
What system parameters should checked?
In vino veritas, in VMS cluster
17 REPLIES 17
Robert Gezelter
Honored Contributor

Re: Problem with Cluster

Vladimir,

My first question is: A total freeze, or Oracle and the applications freeze (and you retain access from terminal windows)?

I would recommend running T4 (see the OpenVMS www site) and collecting and analyzing the resulting data. You could be running out of something, but there are many possibilities.

Also, I would consider if I can force a crash dump manually. Analysis of the dump file should show what is hung on what (presuming that it is an OpenVMS problem and not a problem within the application or Oracle).

- Bob Gezelter, http://www.rlgsc.com
Vladimir Fabecic
Honored Contributor

Re: Problem with Cluster

Hello Bob
Thants for fast reply. It is not total freeze. It even opens new terminal in X but does not give $ prompt. Looks like it is running out of something. Since it is production envirement, I must react fast. I will do some monitoring tomorow. I know there are many possibilities, but what would be your first guess?
In vino veritas, in VMS cluster
Volker Halle
Honored Contributor

Re: Problem with Cluster

Vladimir,

same questions as Bob: what is 'hanging' ?

- can you still do a PING node ?
- can you login via TELNET, LAT, DECnet ?
- how do you 'reboot' that node (just hitting restart-switch) ?
- what does a SHO SYS/NODE=xxx show if issued from the other node when the first one is 'hung' ? Any processes in RW* state ?

Volker.

Robert Gezelter
Honored Contributor
Solution

Re: Problem with Cluster

Vladimir,

Locks, pool, and various quotas come to mind.

Getting a fairly comprehensive T4 output would be helpful.

I would also consider if the problem gives warnings in the hour or so before the freeze actually happens. I would also check if somebody is doing some automated process at or about the time of the freeze. I would also hook up one or more network sniffers to the applicabale network connections to monitor traffic to/from the node (Wireshark, the successor to Ethereal, is available as a free download, so having multiple monitors should not be a problem)_.

- Bob Gezelter, http://www.rlgsc.com
Volker Halle
Honored Contributor

Re: Problem with Cluster

Vladimir,

if there is nothing in OPERATOR.LOG, please also watch the console terminal for any messages (Mount-verification ?)

If you don't get a $ prompt, you're likely to hit the RESTART button to reboot your system. Try HALT button and >>> CRASH instead. It will take some time to write the dump, but that will probably be the only way to find out what's wrong.

Try logging in using Username: xxx/NOCOMMAND to skip your login-procedures, they may hang due to some problem.

Try to keep a terminal logged in before 09:00 AM to be able to look around once the problem hits.

Volker.
Andy Bustamante
Honored Contributor

Re: Problem with Cluster


Are there any console messages displayed? Assuming a quorum disk, is there any production I/O on the quorum disk?

You state gigabit ethernet cluster interconnect, is this dedicated to cluster traffic or does it share application traffic and cluster traffic?

There was a similiar behavior with 7.3-2 and TCPIP 5.4 corrected in ECO4 if I recall correctly. Are you current with TCPIP ECOs? This may or may not be an issue in TCPIP 5.5 included with VMS 8.2.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Robert Gezelter
Honored Contributor

Re: Problem with Cluster

Vladimir,

Also consider opening a SYSMAN session on the other cluster node, with a SET ENVIRONMENT to the node that is failing.

I have seen situations where terminal sessions were useless, but the SYSMAN session remained usable.

- Bob Gezelter, http://www.rlgsc.com
Vladimir Fabecic
Honored Contributor

Re: Problem with Cluster

PING can be done. Telnet gives timeout. 'Reboot' is done with restart switch. Console terminal was not turned on so no messages.
I got these informations from customer.

Gigabit ethernet cluster interconnect is dedicated to cluster traffic. There is no production I/O on the quorum disk. All newest patches are installed including TCPIP 5.5 ECO1.
Tomorow I will do some monitoring as suggested by Valker and Bob.
If no other way I will force crash the day after.
Terminal will be connected to Reflection session so everithing will be logged.
I do not think it is Oracle problem because nothing has been changed in Oracle software.
Looks like a parameter (or quota) problem to me, but I will have much more informations tomorow.
Guys, thanks a lot for helping me.
In vino veritas, in VMS cluster
labadie_1
Honored Contributor

Re: Problem with Cluster

From the node still working, do a
sh sys/node=other, to see if you have many process in "interesting" states (rwxxx, mutex...).

As said before, try a
mc sysman set env/node=other
do any command

If a login fails after the username, this can mean pagedyn is too low.

Take a crash, you will have something to analyse

The best advice: install Amds or Availability Manager, you will have all the good data available to know what is going wrong.