- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Problem with Cluster
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 03:19 AM
тАО10-17-2006 03:19 AM
Configuration: 2 node cluster (ES47, ES45), OS version is 8.2, 2X storage MA8000, gigabit eth cluster interconnect
First node (ES47: 4 CPU 24 GB RAM) is running 13 Oracle instances.
Problem is that this machine hangs about 9:00 AM whem workload reaches top. Second node is currently diong nothing (some databases were not created yet) and is working fine.
When this node is rebooted, it is working fine until next day 9:00 AM (just four instances are active 24 hours a day).
Before creating cluster this machine worked fine as standalone. It even worked OK one day as single node cluster (before adding second node to cluster). As single node, it worked with 10 HSG disks, now is working with 28 HSG disks
No crash dump file, nothing in operator.log.
From where to start dubuging?
What system parameters should checked?
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 03:24 AM
тАО10-17-2006 03:24 AM
Re: Problem with Cluster
My first question is: A total freeze, or Oracle and the applications freeze (and you retain access from terminal windows)?
I would recommend running T4 (see the OpenVMS www site) and collecting and analyzing the resulting data. You could be running out of something, but there are many possibilities.
Also, I would consider if I can force a crash dump manually. Analysis of the dump file should show what is hung on what (presuming that it is an OpenVMS problem and not a problem within the application or Oracle).
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 03:51 AM
тАО10-17-2006 03:51 AM
Re: Problem with Cluster
Thants for fast reply. It is not total freeze. It even opens new terminal in X but does not give $ prompt. Looks like it is running out of something. Since it is production envirement, I must react fast. I will do some monitoring tomorow. I know there are many possibilities, but what would be your first guess?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 03:56 AM
тАО10-17-2006 03:56 AM
Re: Problem with Cluster
same questions as Bob: what is 'hanging' ?
- can you still do a PING node ?
- can you login via TELNET, LAT, DECnet ?
- how do you 'reboot' that node (just hitting restart-switch) ?
- what does a SHO SYS/NODE=xxx show if issued from the other node when the first one is 'hung' ? Any processes in RW* state ?
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 03:58 AM
тАО10-17-2006 03:58 AM
SolutionLocks, pool, and various quotas come to mind.
Getting a fairly comprehensive T4 output would be helpful.
I would also consider if the problem gives warnings in the hour or so before the freeze actually happens. I would also check if somebody is doing some automated process at or about the time of the freeze. I would also hook up one or more network sniffers to the applicabale network connections to monitor traffic to/from the node (Wireshark, the successor to Ethereal, is available as a free download, so having multiple monitors should not be a problem)_.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 04:01 AM
тАО10-17-2006 04:01 AM
Re: Problem with Cluster
if there is nothing in OPERATOR.LOG, please also watch the console terminal for any messages (Mount-verification ?)
If you don't get a $ prompt, you're likely to hit the RESTART button to reboot your system. Try HALT button and >>> CRASH instead. It will take some time to write the dump, but that will probably be the only way to find out what's wrong.
Try logging in using Username: xxx/NOCOMMAND to skip your login-procedures, they may hang due to some problem.
Try to keep a terminal logged in before 09:00 AM to be able to look around once the problem hits.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 04:11 AM
тАО10-17-2006 04:11 AM
Re: Problem with Cluster
Are there any console messages displayed? Assuming a quorum disk, is there any production I/O on the quorum disk?
You state gigabit ethernet cluster interconnect, is this dedicated to cluster traffic or does it share application traffic and cluster traffic?
There was a similiar behavior with 7.3-2 and TCPIP 5.4 corrected in ECO4 if I recall correctly. Are you current with TCPIP ECOs? This may or may not be an issue in TCPIP 5.5 included with VMS 8.2.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 04:12 AM
тАО10-17-2006 04:12 AM
Re: Problem with Cluster
Also consider opening a SYSMAN session on the other cluster node, with a SET ENVIRONMENT to the node that is failing.
I have seen situations where terminal sessions were useless, but the SYSMAN session remained usable.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 04:45 AM
тАО10-17-2006 04:45 AM
Re: Problem with Cluster
I got these informations from customer.
Gigabit ethernet cluster interconnect is dedicated to cluster traffic. There is no production I/O on the quorum disk. All newest patches are installed including TCPIP 5.5 ECO1.
Tomorow I will do some monitoring as suggested by Valker and Bob.
If no other way I will force crash the day after.
Terminal will be connected to Reflection session so everithing will be logged.
I do not think it is Oracle problem because nothing has been changed in Oracle software.
Looks like a parameter (or quota) problem to me, but I will have much more informations tomorow.
Guys, thanks a lot for helping me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 04:52 AM
тАО10-17-2006 04:52 AM
Re: Problem with Cluster
sh sys/node=other, to see if you have many process in "interesting" states (rwxxx, mutex...).
As said before, try a
mc sysman set env/node=other
do any command
If a login fails after the username, this can mean pagedyn is too low.
Take a crash, you will have something to analyse
The best advice: install Amds or Availability Manager, you will have all the good data available to know what is going wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 05:28 AM
тАО10-17-2006 05:28 AM
Re: Problem with Cluster
if PING works, but TELNET gives a timeout, could it be a process creation/scheduling problem ? A high PRIO looping job preventing any other processes to receive any CPU time ?
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 07:26 AM
тАО10-17-2006 07:26 AM
Re: Problem with Cluster
pagefrag
pagecrit
noslot
no pcb available
Good hunt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 07:46 AM
тАО10-17-2006 07:46 AM
Re: Problem with Cluster
did you consider a lock tree remastering?
What are the values for the SYSGEN parameters
LOCKDIRWT and PE1 on both nodes?
Regards,
Albert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 08:56 AM
тАО10-17-2006 08:56 AM
Re: Problem with Cluster
Had a similar problem. Check your locks. HP changed some memory locking stuff in 8.2. If your locking rate has become excessive, install SYS500 and UPDATE400. There is a patch that reverts the behavior back to 7.3-2 for locking pages.
Hope that helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 07:28 PM
тАО10-17-2006 07:28 PM
Re: Problem with Cluster
It may show some of the parameters to adjust.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 08:00 PM
тАО10-17-2006 08:00 PM
Re: Problem with Cluster
what is the 'typical' behavior?
Gradual slowdown till things stop, or all going normal until 'sudden death'?
So many questios asked already, I guess the right one is there, but you need facts to decide which one.
I like Joseph's AUTOGEN idea. It could give a lot of info, even before you run stuck again.
Good hunting!
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 10:44 PM
тАО10-17-2006 10:44 PM
Re: Problem with Cluster
I think I found the reason of problem. Yesterday I doubled CHANNELCNT parameter and everything is working fine so far.
Some answers to question:
what is the 'typical' behavior?
Behavior was all going normal until 'sudden death'.
did you do a simple AUTOGEN with feedback ?
I did. There was nothing about CHANNELCNT.
LOCKDIRWT and PE1 are set to 0 on both nodes
I will try to schedule some downtime to encrease NPAGEDYN and NPAGEVIR because of new database instance.
Again, thanks a lot for your help and time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 11:07 PM
тАО10-17-2006 11:07 PM
Re: Problem with Cluster
At first glance, that would certainly appear to be able to produce the symptoms that you described.
I would also recommend checking other paramters which may be close to a problem area. It is hard to come up with a solid rule, but I would take a look at everything that is over 50-60% (since presumably, this is to become a two node cluster).
- Bob Gezelter, http://www.rlgsc.com