Two node cluster, but only one at a time is up

VMSCheck · ‎04-22-2021

Just tried to reset this node and logs in detail from how I did reset was as follows and stuck there as mentioned before:

- - - - - - - - - - Prior Console Output - - - - - - - - - -
%LLC0, Logical LAN event at 20-APR-2021 19:02:02.30
%LLC0, Logical LAN failover device added to failset, EWK0
%LLC0, Logical LAN event at 20-APR-2021 19:02:02.30
%LLC0, Logical LAN failset device connected to physical device EWD0
%SYSINIT-I- found a valid OpenVMS Cluster quorum disk
%SYSINIT-I- waiting to form or join an OpenVMS Cluster
%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Using local access method for quorum disk
%CNXMAN, Established "connection" to quorum disk
%CNXMAN, Have "connection" to quorum disk
- - - - - - - - - - - - Live Console - - - - - - - - - - - -

MP MAIN MENU:

CO: Console
VFP: Virtual Front Panel
CM: Command Menu
CL: Console Log
SL: Show Event Logs
HE: Main Help Menu
X: Exit Connection

[ilo-xxxx-04]</> hpiLO-> cm

(Use Ctrl-B to return to MP main menu.)

[ilo-xxxx-04] CM:hpiLO-> rs

RS

Execution of this command irrecoverably halts all system processing and
I/O activity and restarts the computer system.

Type Y to confirm your intention to restart the system: (Y/[N]) y
y
-> SPU hardware was successfully issued a reset.

[ilo-xxxx-04] CM:hpiLO->

MP MAIN MENU:

CO: Console
VFP: Virtual Front Panel
CM: Command Menu
CL: Console Log
SL: Show Event Logs
HE: Main Help Menu
X: Exit Connection

[ilo-xxxx-04]</> hpiLO-> co

[Use Ctrl-B or ESC-( to return to MP main menu.]

- - - - - - - - - - Prior Console Output - - - - - - - - - -
%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Using local access method for quorum disk
%CNXMAN, Established "connection" to quorum disk
2,1,2,0 5404006349E10000 0000000000000000 EVN_BOOT_START
***********************************************************
* ROM Version : 01.98
* ROM Date : Fri Sep 11 00:56:00 PDT 2015
***********************************************************
2,0,2,0 3404083709E10000 000000000002000C EVN_BOOT_CELL_JOINED_PD

- - - - - - - - - - - - Live Console - - - - - - - - - - - -
2,1,2,0 340400B149E10000 000000480205000C EVN_MEM_DISCOVERY
2,0,2,0 340400B109E10000 000000080205000C EVN_MEM_DISCOVERY
2,0,2,0 Start memory test ...... 0/100
.......
2,0,2,0 Memory test progress.... 33/100
.......
2,0,2,0 Memory test progress.... 66/100
.......
2,0,2,0 Memory test progress.... 100/100
2,0,2,0 1404002609E10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,0,3,0 140400260DE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,1,2,0 1404002649E10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,1,3,0 140400264DE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,0,3,1 140400260FE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,1,3,1 140400264FE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,1,2,1 140400264BE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,0,2,1 140400260BE10000 000000000006000C EVN_BOOT_CPU_LATE_TEST_START
2,0,2,0 5404020709E10000 000000000011000C EVN_EFI_START

Press Ctrl-C now to bypass loading option ROM UEFI drivers.

2,0,2,0 3404008109E10000 000000000007000C EVN_IO_DISCOVERY_START
Dual Port Flex10 10GbE BL8XXc i2 Embedded CNIC is detected
Dual Port Flex10 10GbE BL8XXc i2 Embedded CNIC is detected
Dual Port Flex10 10GbE BL8XXc i2 Embedded CNIC is detected
Dual Port Flex10 10GbE BL8XXc i2 Embedded CNIC is detected
HP PCIe 2Port 8Gb Fibre Channel Adapter (driver 2.27, firmware 5.06.006)
HP PCIe 2Port 8Gb Fibre Channel Adapter (driver 2.27, firmware 5.06.006)
2,0,2,0 5404020B09E10000 0000000000000006 EVN_EFI_LAUNCH_BOOT_MANAGER
(C) Copyright 1996-2010 Hewlett-Packard Development Company, L.P.

Note, menu interfaces might only display on the primary console device.
The current primary console device is:
Serial PcieRoot(0x30304352)/Pci(0x1,0x0)/Pci(0x0,0x5)
The primary console can be changed via the 'conconfig' UEFI shell command.

Press: ENTER - Start boot entry execution
B / b - Launch Boot Manager (menu interface)
D / d - Launch Device Manager (menu interface)
M / m - Launch Boot Maintenance Manager (menu interface)
S / s - Launch UEFI Shell (command line interface)
I / i - Launch iLO Setup Tool (command line interface)

*** User input can now be provided ***

Automatic boot entry execution will start in 1 second(s).
Booting xxxx Normal Boot $1$DGA300: FGB0.2012-0002-AC00-3D42

PGQBT-I-INIT-UNIT, IPB, PCI device ID 0x2532, FW 4.04.04
PGQBT-I-BUILT, version X-33, built on Jan 16 2015 @ 12:02:52
PGQBT-I-LINK_WAIT, waiting for link to come up
PGQBT-I-TOPO_WAIT, waiting for topology ID

%RAD-I-ENABLED, RAD Support is enabled for 2 RADs

HP OpenVMS Industry Standard 64 Operating System, Version V8.4
▒ Copyright 1976-2019 Hewlett-Packard Development Company, L.P.

PGQBT-I-INIT-UNIT, boot driver, PCI device ID 0x2532, FW 4.04.04
PGQBT-I-BUILT, version X-33, built on Jul 19 2011 @ 16:12:20
PGQBT-I-LINK_WAIT, waiting for link to come up
PGQBT-I-TOPO_WAIT, waiting for topology ID
%DECnet-I-LOADED, network base image loaded, version = 05.17.02

%CNXMAN, Using remote access method for quorum disk
%SMP-I-CPUTRN, CPU #1 has joined the active set.
%SMP-I-CPUTRN, CPU #5 has joined the active set.
%SMP-I-CPUTRN, CPU #2 has joined the active set.
%SMP-I-CPUTRN, CPU #6 has joined the active set.
%SMP-I-CPUTRN, CPU #4 has joined the active set.
%SMP-I-CPUTRN, CPU #3 has joined the active set.
%SMP-I-CPUTRN, CPU #7 has joined the active set.
%VMScluster-I-LOADSECDB, loading
the cluster security database
%EWA0, Link up: 10 gbit, full duplex, flow control disabled
%EWE0, Function is disabled

%EWF0, Function is disabled

%EWG0, Function is disabled

%EWC0, Link up: 10 gbit, full duplex, flow control disabled
%EWH0, Function is disabled

%EWD0, Link up: 10 gbit, full duplex, flow control disabled
%EWB0, Link up: 10 gbit, full duplex, flow control disabled
%EWI0, Link up: 10 gbit, full duplex, flow control disabled
%EWM0, Function is disabled

%EWN0, Function is disabled

%EWO0, Function is disabled

%EWP0, Function is disabled

%EWK0, Link up: 10 gbit, full duplex, flow control disabled
%EWJ0, Link up: 10 gbit, full duplex, flow control disabled
%EWL0, Link up: 10 gbit, full duplex, flow control disabled
%EWA0, Jumbo frames enabled
%EWJ0, Jumbo frames enabled
%LLA0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLA0, Logical LAN failset device created
%LLA0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLA0, Logical LAN failover device added to failset, EWC0
%LLA0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLA0, Logical LAN failover device added to failset, EWL0
%LLA0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLA0, Logical LAN failset device connected to physical device EWL0
%LLB0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLB0, Logical LAN failset device created
%LLB0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLB0, Logical LAN failover device added to failset, EWB0
%LLB0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLB0, Logical LAN failover device added to failset, EWI0
%LLB0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLB0, Logical LAN failset device connected to physical device EWI0
%LLC0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLC0, Logical LAN failset device created
%LLC0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLC0, Logical LAN failover device added to failset, EWD0
%LLC0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLC0, Logical LAN failover device added to failset, EWK0
%LLC0, Logical LAN event at 22-APR-2021 16:44:10.94
%LLC0, Logical LAN failset device connected to physical device EWD0
%SYSINIT-I- found a valid OpenVMS Cluster quorum disk
%SYSINIT-I- waiting to form or join an OpenVMS Cluster
%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Using local access method for quorum disk
%CNXMAN, Established "connection" to quorum disk
%CNXMAN, Have "connection" to quorum disk

Hein_vdHeuvel_d · ‎04-22-2021

>> Yes, was working before. Admin who was maintaining this server left and I was asked to look into it. I am more UNIX person and we don't have support .

It's very gracious of Volker to try to help, but I would urge you to go back to your management and tell them this is beyond basic operations and a simple google searched. You have done a good job reaching out and found this forum, but now the time has come to pay up

Let them hire a consultant for good money and use the opportunity to learn about the system. They got away with it thos far and now let them pay for their sins so to speak. They probably/possibly saved tens of thousands in maintenance and/or hiring/training a person with the rigth skill set. Now it is it time to pay a few thousand to a desrving consultant (not me!)

Good luck,

Hein

VMSCheck · ‎04-22-2021

Management is in the process of doing it and it takes weeks for renewal, bad timing it broke is all. But I thought this is the forum to discuss and get help each other as I am ineterested to debug and fix. It will be discouraging for unix people to learn openvms when I see this.

Dave Lennon · ‎04-22-2021

Hi, is it safe to say it is "hanging" right after it mentions the quorum disk? You did give it several minutes (maybe up to 5) to re-establish quorum, right?

I believe the quorum disk needs to be "VMS initialized" before it is used by the clustering software -- I know it will re-create the quorum.dat file in the top-level directory if someone accidentally deletes it (very early on in their career).

Again, not a simple fix for someone not well-versed in VMS, but I'd suggest booting one node "conversationally" or perhaps off the install DVD (both are not up running VMS now, right?) and making sure that quorum disk unit is a-okay, i.e. it can be mounted as a VMS (ODS-2 or ODS-5) volume.

It should not matter if the disk unit is thin or thick provisioned on the SAN storage array, VMS really doesn't care (or know).

Did you say your site will have a position for VMS system manager? Where is it located?

VMSCheck · ‎04-22-2021

Yes, I waited long enough and infact it is staying at that stage since a day. I did reset and it came back again and hung there. Would it possible when SAN concoverted to thick and thin, can that change anything to break on system level to cause the issue we are seeing?

Hein_vdHeuvel_d · ‎04-22-2021

>> Management is in the process of doing it and it takes weeks for renewal, bad timing it broke is all.

Good to hear. Too bad they did not plan ahead make use of the prior system managers (life long/) experience to cross-train, but unfortunately that is too often how it goes.

>> But I thought this is the forum to discuss and get help each other as I am ineterested to debug and fix.

And you did great so far, as i indicated. You found the best of the best. But there are limits to what one can convey in a forum and the back-and-forth can take a lot of time.

When you started to describe disks in totally amateuristic terms ""quorum disk was changed from thick to thin ". We were please to see you identified it as a quorum disk though! That was essential.

It wouldn't surprise me if Dave Lennon identified a critical step - was that disk initialized? Or really - when the disk was replaced, what steps were taken to restore its original contents? Backup restored as per your system operations playbook? Backup restored through magic storage actions?

>> It will be discouraging for unix people to learn openvms when I see this.

I don't think so, but to each their own opinion.

Folks have been going out of their way to help get you on track and have been very responsive to problem which originally had NOTHING pertinent to go on beyond "it doesn't work" - no error message, no (screen) output to show what leads you to the conclusion it was not working, barely an identification of the bits and pieces. The epression "Like pulling teeth" comes to mind. Now that you learned a bunch more, I encourage you to read back your original problem report and see how it really needs a mindreader to help you, Fortunately, you found one.

Good luck,

Hein.

abrsvc · ‎04-22-2021

As others have stated, there is only so much that can be done in a forum like this. Where are you located? Even a phone consult may help resolve this. There appears to be a fundamental pathway missing here that may more easily be found via phone or a terminal session. Send a private message to us to set it up. There may be a charge as this is how many of us make a living, but if it is important to get this system working,, take advantage of the contact points here.

Dan

Volker Halle · ‎04-23-2021

I agree with the advice given by others: you do need an experienced OpenVMS consultant to diagnose and fix this problem. Maintaining a working OpenVMS cluster does NOT need a full-time OpenVMS consultant, but in a situation like this, you need experienced help - as you've probably learned by now. Go and convince your management. Note that you could contact e.g. Dan (abrsvc) via personal mail in this forum.

Being in the same/similar timezone as 'the problem' also helps - although it gives me a lot of time to diagnose the information you've posted 'last night' and prepare some more questions to further narrow down on the problem. It also allows me more time to re-think and re-edit my reply.

Here is a refined problem description:

2 node Itanium OpenVMS V8.4 Blade SAN cluster with quorum disk - only ONE node can be started at a time, the 2nd one hangs after the following console messages:

%SYSINIT-I- found a valid OpenVMS Cluster quorum disk
%SYSINIT-I- waiting to form or join an OpenVMS Cluster
%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Using local access method for quorum disk
%CNXMAN, Established "connection" to quorum disk
%CNXMAN, Have "connection" to quorum disk

Google is your friend, but you need experience in OpenVMS troubleshooting to know what to search for...

Start searching for "Have connection to quorum disk" - you'll find a couple of articles with this symptom, none of them will give you a solution, but help you learn about the context. This message is output by the connection manager, if the node cannot create or join the cluster after about 2 minutes after boot.

The important thing here is, what's NOT shown on the console ! Assuming you've literally copied ALL console output, the missing piece is a message like %CNXMAN, have connection to system XXXXXX

This message would indicate, that the booting node is SEEING the 'other' node via one of the cluster communication LAN pathes, in this case one of the LAN failover sets (LLc0). or a physical LAN interface. This currently does NOT seem to be the case and that's preventing the 2nd node from joining the cluster with the other node.

Please try to answer the following questions by providing detailled data:

1) what EXACTLY did happen, when the problem started - as you described - 'One of the nodes in the cluster was down' Please provide the console output from BOTH systems from the time 'when that node went down' - you now have learned how to scroll the console output.

2) try a conversational boot and look at the relevant cluster system parameters of the 'hanging' node

In one of your posts, you showed:

SYSBOOT> set STARTUP_P2 "YES"

SYSBOOT> continue

Although setting STARTUP_P2 "YES" does NOT help in this case, try to repeat whatever commands you've entered to get to the SYSBOOT> prompt (scroll back through the console log to review your commands) and issue the necessary SHOW ... commands to view the critical cluster system parameters (same syntax as with SYSGEN> prompt)

3) find successful previous boot events of both nodes in the console logs

Try to find - and save ! - console messages from the most recent successful boot attempts of both nodes. Keep them as a reference and compare the contents to the current situation

4) find the documentation of the LAN configuration for this cluster

As these seem to be Blade systems - I have no practical experience with Blades, those arrived after my 25 years at Digital/Compaq/HP - the LAN configuration may play a crucial role in this problem.

Please also think about the location of those 2 Blades. Are they in the same rack or at different sites. This information may influence further troubleshooting.

Regards,

Volker.

Brad McCusker · ‎04-23-2021

Make sure you engage someone who also understands the blade enclosure and all of it's pieces - this could be a problem somewhere in those links.

FWIW, our core business is managing OpenVMS systems in situations just like yours: System Manager retired/left and no one knows VMS. My contact information should be available in my profile if your manager wants to engage someone to fix this problem and/or properly care for these systems long term. And our team includes experts on the blade enclosures who have given talks on them for HPE.

I recieved this quote from a potential customer recently - this guy understood the situation: "Ideally we should have VMS specialists managing our systems rather than Linux and project specialists masquerading as VMS system admins on an ad-hoc basis."

Brad McCusker
Software Concepts International

VMSCheck · ‎04-23-2021

Well, I don't need professional help on this. I fixed it by myself. Sorry, I thought I could get some help here and most were saying I need to get help from support and if I have support why would I come here?

But I really thank Volker. You are the best and thank you for supporting to encourage people like me. Thank you again and I really appreciate.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up

Re: Two node cluster, but only one at a time is up