- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- io autoconfigure and cluexit
Operating System - OpenVMS
1755062
Members
3020
Online
108829
Solutions
Forums
Categories
Company
Local Language
юдл
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
юдл
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Go to solution
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-06-2008 06:31 AM
тАО04-06-2008 06:31 AM
Re: io autoconfigure and cluexit
This looks reversed. I think the dropped CLUEXIT hosts and the dropped DECnet connections and such all point to one host that's got something odd going on, and that this one host is getting bogged down in kernel mode and the electronic catatonia is then causing all of the faults shown so far.
Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger.
I'd look in the error logs, and I'd look in the batch job logs for that AUTOGEN job.
Look at the storage controllers, and at differences and specific configurations of all the devices. Any devices here that are incompletely connected or left unconfigured?
Do patch firmware in controllers and disks and such, patch OpenVMS, and patch anything else here with kernel-mode code to current.
Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)
Stephen Hoffman
HoffmanLabs LLC
Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger.
I'd look in the error logs, and I'd look in the batch job logs for that AUTOGEN job.
Look at the storage controllers, and at differences and specific configurations of all the devices. Any devices here that are incompletely connected or left unconfigured?
Do patch firmware in controllers and disks and such, patch OpenVMS, and patch anything else here with kernel-mode code to current.
Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)
Stephen Hoffman
HoffmanLabs LLC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-06-2008 04:29 PM
тАО04-06-2008 04:29 PM
Re: io autoconfigure and cluexit
Volker,
" as a first and simple workaround, try to increase RECNXINTERVAL on both nodes to exceed the time the other node seems to 'hang' during AUTOGEN."
I changed it to 180 seconds.
" You can calculate the time from the start of the AUTOGEN command (from batch .LOG file) until the crash time of the other node (from the crash). This should at least allow the node to survive a temporary loss of connection. "
The batch job log, while not showing anything useful, shows it is taking around 5 minutes to complete. Older logs showed it takes around 3 minutes.
A solution seems to have been found. I have replied to Hoff regarding this.
Regards
Mark
" as a first and simple workaround, try to increase RECNXINTERVAL on both nodes to exceed the time the other node seems to 'hang' during AUTOGEN."
I changed it to 180 seconds.
" You can calculate the time from the start of the AUTOGEN command (from batch .LOG file) until the crash time of the other node (from the crash). This should at least allow the node to survive a temporary loss of connection. "
The batch job log, while not showing anything useful, shows it is taking around 5 minutes to complete. Older logs showed it takes around 3 minutes.
A solution seems to have been found. I have replied to Hoff regarding this.
Regards
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-06-2008 04:35 PM
тАО04-06-2008 04:35 PM
Re: io autoconfigure and cluexit
Hoff,
" This looks reversed. I think the dropped CLUEXIT hosts and the dropped DECnet connections and such all point to one host that's got something odd going on, and that this one host is getting bogged down in kernel mode and the electronic catatonia is then causing all of the faults shown so far. "
Yes, quite correct, one server, and only one, when io autoconfigure is run during autogen's getdata phase causes the other member of the cluster to cluexit.
" Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger."
It seems it was a "wonky" Compaq 5300 raid controller that was the cause. Once this was replaced, the problem disappeared.
" Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)"
I agree I will do that immediately. But, hey, without it we might not have found the wonky card until something more serious had occurred.
Regards,
Mark
" This looks reversed. I think the dropped CLUEXIT hosts and the dropped DECnet connections and such all point to one host that's got something odd going on, and that this one host is getting bogged down in kernel mode and the electronic catatonia is then causing all of the faults shown so far. "
Yes, quite correct, one server, and only one, when io autoconfigure is run during autogen's getdata phase causes the other member of the cluster to cluexit.
" Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger."
It seems it was a "wonky" Compaq 5300 raid controller that was the cause. Once this was replaced, the problem disappeared.
" Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)"
I agree I will do that immediately. But, hey, without it we might not have found the wonky card until something more serious had occurred.
Regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-06-2008 04:42 PM
тАО04-06-2008 04:42 PM
Re: io autoconfigure and cluexit
Firstly, thank you to Jur, Kalle, volker and Hoff.
The main problem stems from an inadequate RECNXINTERVAL setting which cannot cope with extended "hanging" on the other server.
As a corollary to this, the hanging was the result of bad hardware causing sysman's io autoconfigure to become stuck for a lengthy period.
The solution was to remove/replace the bad card and increase the RECNXINTERVAL value to 180.
I never realised that autogen getdata runs sysman's io autoconfigure.
Thanks once again to all those who put their time into helping.
Regards,
Mark.
The main problem stems from an inadequate RECNXINTERVAL setting which cannot cope with extended "hanging" on the other server.
As a corollary to this, the hanging was the result of bad hardware causing sysman's io autoconfigure to become stuck for a lengthy period.
The solution was to remove/replace the bad card and increase the RECNXINTERVAL value to 180.
I never realised that autogen getdata runs sysman's io autoconfigure.
Thanks once again to all those who put their time into helping.
Regards,
Mark.
- « Previous
-
- 1
- 2
- Next »
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
News and Events
Support
© Copyright 2024 Hewlett Packard Enterprise Development LP