- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- io autoconfigure and cluexit
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2008 04:36 PM
04-03-2008 04:36 PM
A two node cluster both on 7.3-2, one patched to update 15, the other to update 13.
If, at any time, I run sysman io autoconfigure on one particular server, the other server cluexits. If I run it on the latter, it has no effect on the former. See attached for the cluexit info.
Has anyone ever seen or heard of this happening?
We have eliminated power, heat and network outages as the possible causes, albeit from the non-VMS side of things.
VMS gives out on the console:
%CNXMAN, Quorum lost, blocking activity
It then removes the other server's disks from the shadowset and does the following:
**** OpenVMS (TM) Alpha Operating System V7.3-2 - BUGCHECK ****
** Bugcheck code = 000005DC: CLUEXIT, Node voluntarily exiting VMScluster
** Crash CPU: 00 Primary CPU: 00 Active CPUs: 00000001
** Current Process = NULL
** Current PSB ID = 00000001
** Image Name =
**** Starting compressed selective memory dump at 1-APR-2008 01:36...
............................................................................
.....................................................
** System space, key processes, and key global pages have been dumped.
** Now dumping remaining processes and global pages...
.....................
.Complete ****
Then it reboots and we continue on.
Regards,
Mark
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2008 09:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2008 11:51 PM
04-03-2008 11:51 PM
Re: io autoconfigure and cluexit
Jur.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 01:16 PM
04-04-2008 01:16 PM
Re: io autoconfigure and cluexit
" Just a wild guess, perhaps the AUTOCONFIGURE takes too long on the system and prevents it from sending the cluster hello messages. What is the value of RECNXINTERVAL and perhaps those other *INTERVAL parameters?"
I have read a previous discourse: http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1065647
which covers what you mention. Our RECNXINTERVAL is set to 20 seconds.
We have 2 absolutely independent means of cluster connectivity, so it was assumed (perhaps wrongly) that neither of these could fail under normal circumstances at the same time and/or for that period of time (20 seconds).
What other "*INTERVAL" parameters are you referring to? MSCP stuff?
Regards,
Mark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 01:18 PM
04-04-2008 01:18 PM
Re: io autoconfigure and cluexit
" There can be something in sys$manager:syconmfig.com that may be of influence. This is typically caused by a devicedriver staying way too long at elevated ipl.
"
I'm confused. What does syconfig.com have to do with sysman's io autoconfigure?
Regards,
Mark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 01:20 PM
04-04-2008 01:20 PM
Re: io autoconfigure and cluexit
For the last 6 days at 2 day intervals and at approximately 1:10am every morning, this same system CLUEXITs.
The other server, is at this time, coincidentally, beginning its tape backup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 01:22 PM
04-04-2008 01:22 PM
Re: io autoconfigure and cluexit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 01:29 PM
04-04-2008 01:29 PM
Re: io autoconfigure and cluexit
Apologies, I was for some reason thinking of sylogicals, not syconfig. Sorry.
There is nothing in syconfig.
Regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2008 02:03 PM
04-04-2008 02:03 PM
Re: io autoconfigure and cluexit
Every second day, Tuesday, Thursday & Saturday at 1am, it runs the following command:
@sys$update:autogen GETDATA TESTFILES feedback
When I run it manually, decnet connected from another machine:
SRS$USER:MARK> @sys$update:autogen GETDATA TESTFILES feedback
%AUTOGEN-I-BEGIN, GETDATA phase is beginning.
Running SYCONFIG.COM
End of SYCONFIG.COM
%REM-F-NETERR, DECnet channel error on remote terminal link
%REM-S-END, control returned to node LOCAL:.WOMBAT::
%SYSTEM-F-PATHLOST, path to network partner node lost
The connection froze, then dropped out. This would then, I presume, trigger the RECNXINTERVAL to be exceeded because the wait is far more than 20 seconds.
So it seems that something is askew on one server which is making even autogen lock it up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2008 02:00 AM
04-06-2008 02:00 AM
Re: io autoconfigure and cluexit
as a first and simple workaround, try to increase RECNXINTERVAL on both nodes to exceed the time the other node seems to 'hang' during AUTOGEN. You can calculate the time from the start of the AUTOGEN command (from batch .LOG file) until the crash time of the other node (from the crash). This should at least allow the node to survive a temporary loss of connection.
RECNXINTERVAL is a dynamic parameter, so you could even just increase it (cluster-wide) from the batch job running your semi-daily AUTOGEN procedure:
$ MC SYSMAN
SYSMAN> SET ENV/CLUS
SYSMAN> PARA USE ACTIVE
SYSMAN> PARA SET RECNXINTERVAL 180
SYSMAN> PARA WRITE ACTIVE
SYSMAN> EXIT
A node will consider connectivity to be lost when not receiving a cluster hello multicast-msg from any other node in the cluster for more than about 9 seconds. It will time out the other node and remove it from it's view of the cluster, if it has not received another hello message within RECNXINTERVAL seconds. If the node has been removed and then the next hello message will be received, the nodes in the cluster have to determine, which subset of nodes may survive. It the local node is not part of that subset, it will crash with a CLUEXIT and re-join the cluster during reboot.
To find out, what's happening during the AUTOGEN run, you may want to run PC tracing (see SDA> PCS) and find out, which code is running at high IPL for an extended amount of time. Or crash that node, while AUTOGEN is running, after the other node has reported 'Lost connection'.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2008 06:31 AM
04-06-2008 06:31 AM
Re: io autoconfigure and cluexit
Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger.
I'd look in the error logs, and I'd look in the batch job logs for that AUTOGEN job.
Look at the storage controllers, and at differences and specific configurations of all the devices. Any devices here that are incompletely connected or left unconfigured?
Do patch firmware in controllers and disks and such, patch OpenVMS, and patch anything else here with kernel-mode code to current.
Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)
Stephen Hoffman
HoffmanLabs LLC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2008 04:29 PM
04-06-2008 04:29 PM
Re: io autoconfigure and cluexit
" as a first and simple workaround, try to increase RECNXINTERVAL on both nodes to exceed the time the other node seems to 'hang' during AUTOGEN."
I changed it to 180 seconds.
" You can calculate the time from the start of the AUTOGEN command (from batch .LOG file) until the crash time of the other node (from the crash). This should at least allow the node to survive a temporary loss of connection. "
The batch job log, while not showing anything useful, shows it is taking around 5 minutes to complete. Older logs showed it takes around 3 minutes.
A solution seems to have been found. I have replied to Hoff regarding this.
Regards
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2008 04:35 PM
04-06-2008 04:35 PM
Re: io autoconfigure and cluexit
" This looks reversed. I think the dropped CLUEXIT hosts and the dropped DECnet connections and such all point to one host that's got something odd going on, and that this one host is getting bogged down in kernel mode and the electronic catatonia is then causing all of the faults shown so far. "
Yes, quite correct, one server, and only one, when io autoconfigure is run during autogen's getdata phase causes the other member of the cluster to cluexit.
" Memory errors and wonky hardware (flaky disks, flaky buses, incomplete or errant hardware configurations, down-revision firmware) can sometimes do this, though errant kernel-mode software another common trigger."
It seems it was a "wonky" Compaq 5300 raid controller that was the cause. Once this was replaced, the problem disappeared.
" Do shut off that AUTOGEN job. (If you're not seeing the load vary and you have the box already reasonably dialed in, there's probably not a big need to run it.)"
I agree I will do that immediately. But, hey, without it we might not have found the wonky card until something more serious had occurred.
Regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2008 04:42 PM
04-06-2008 04:42 PM
Re: io autoconfigure and cluexit
The main problem stems from an inadequate RECNXINTERVAL setting which cannot cope with extended "hanging" on the other server.
As a corollary to this, the hanging was the result of bad hardware causing sysman's io autoconfigure to become stuck for a lengthy period.
The solution was to remove/replace the bad card and increase the RECNXINTERVAL value to 180.
I never realised that autogen getdata runs sysman's io autoconfigure.
Thanks once again to all those who put their time into helping.
Regards,
Mark.