Operating System - OpenVMS
1825774 Members
2003 Online
109687 Solutions
New Discussion

Re: Two node cluster, but only one at a time is up

 
Hein_vdHeuvel_d
Advisor

Re: Two node cluster, but only one at a time is up

>> Management is in the process of doing it and it takes weeks for renewal, bad timing it broke is all.

Good to hear. Too bad they did not plan ahead make use of the prior system managers (life long/) experience to cross-train, but unfortunately that is too often how it goes.

>> But I thought this is the forum to discuss and get help each other as I am ineterested to debug and fix.

And you did great so far, as i indicated.  You found the best of the best. But there are limits to what one can convey in a forum and the back-and-forth can take a lot of time. 

When you started to describe disks in totally amateuristic terms ""quorum disk was changed from thick to thin ". We were please to see you identified it as a quorum disk though! That was essential.

It wouldn't surprise me if Dave Lennon identified a critical step - was that disk initialized? Or really - when the disk was replaced, what steps were taken to restore its original contents? Backup restored as per your system operations playbook? Backup restored through magic storage actions?

>> It will be discouraging for unix people to learn openvms when I see this.

I don't think so, but to each their own opinion.

Folks have been going out of their way to help get you on track and have been very responsive to  problem which originally had NOTHING pertinent to go on beyond "it doesn't work" - no error message, no (screen) output to show what leads you to the conclusion it was not working, barely an identification of the bits and pieces. The epression "Like pulling teeth" comes to mind. Now that you learned a bunch more, I encourage you to read back your original problem report and see how it really needs a mindreader to help you, Fortunately, you found one.

Good luck,

Hein.

abrsvc
Respected Contributor

Re: Two node cluster, but only one at a time is up

As others have stated, there is only so much that can be done in a forum like this.  Where are you located?  Even a phone consult may help resolve this.  There appears to be a fundamental pathway missing here that may more easily be found via phone or a terminal session.  Send a private message to us to set it up.  There may be a charge as this is how many of us make a living, but if it is important to get this system working,, take advantage of the contact points here.

Dan

Volker Halle
Honored Contributor

Re: Two node cluster, but only one at a time is up

I agree with the advice given by others: you do need an experienced OpenVMS consultant to diagnose and fix this problem. Maintaining a working OpenVMS cluster does NOT need a full-time OpenVMS consultant, but in a situation like this, you need experienced help - as you've probably learned by now. Go and convince your management. Note that you could contact e.g. Dan (abrsvc) via personal mail in this forum.

Being in the same/similar timezone as 'the problem' also helps - although it gives me a lot of time to diagnose the information you've posted 'last night' and prepare some more questions to further narrow down on the problem. It also allows me more time to re-think and re-edit my reply.

Here is a refined problem description:

2 node Itanium OpenVMS V8.4 Blade SAN cluster with quorum disk - only ONE node can be started at a time, the 2nd one hangs after the following console messages:

%SYSINIT-I- found a valid OpenVMS Cluster quorum disk
%SYSINIT-I- waiting to form or join an OpenVMS Cluster
%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Using local access method for quorum disk
%CNXMAN, Established "connection" to quorum disk
%CNXMAN, Have "connection" to quorum disk

Google is your friend, but you need experience in OpenVMS troubleshooting to know what to search for...

Start searching for "Have connection to quorum disk" - you'll find a couple of articles with this symptom, none of them will give you a solution, but help you learn about the context. This message is output by the connection manager, if the node cannot create or join the cluster after about 2 minutes after boot.

The important thing here is, what's NOT shown on the console ! Assuming you've literally copied ALL console output, the missing piece is a message like %CNXMAN, have connection to system XXXXXX

This message would indicate, that the booting node is SEEING the 'other' node via one of the cluster communication LAN pathes, in this case one of the LAN failover sets (LLc0). or a physical LAN interface. This currently does NOT seem to be the case and that's preventing the 2nd node from joining the cluster with the other node.

Please try to answer the following questions by providing detailled data:

1) what EXACTLY did happen, when the problem started  - as you described - 'One of the nodes in the cluster was down' Please provide the console output from BOTH systems from the time 'when that node went down'  - you now have learned how to scroll the console output.

2) try a conversational boot and look at the relevant cluster system parameters of the 'hanging' node

In one of your posts, you showed: 

SYSBOOT> set STARTUP_P2 "YES"

SYSBOOT> continue

Although setting STARTUP_P2 "YES" does NOT help in this case, try to repeat whatever commands you've entered to get to the SYSBOOT> prompt (scroll back through the console log to review your commands) and issue the necessary SHOW ... commands to view the critical cluster system parameters (same syntax as with SYSGEN> prompt)

3) find successful previous boot events of both nodes in the console logs

Try to find - and save ! - console messages from the most recent successful boot attempts of both nodes. Keep them as a reference and compare the contents to the current situation

4) find the documentation of the LAN configuration for this cluster

As these seem to be Blade systems - I have no practical experience with Blades, those arrived after my 25 years at Digital/Compaq/HP - the LAN configuration may play a crucial role in this problem.

Please also think about the location of those 2 Blades. Are they in the same rack or at different sites. This information may influence further troubleshooting.

Regards,

Volker.

Brad McCusker
Respected Contributor

Re: Two node cluster, but only one at a time is up

Make sure you engage someone who also understands the blade enclosure and all of it's pieces - this could be a problem somewhere in those links.

FWIW, our core business is managing OpenVMS systems in situations  just like yours:  System Manager retired/left and no one knows VMS.  My contact information should be available in my profile if your manager wants to engage someone to fix this problem and/or properly care for these systems long term.  And our team includes experts on the blade enclosures who have given talks on them for HPE.


I recieved this quote from a potential customer recently - this guy understood the situation: "Ideally we should have VMS specialists managing our systems rather than Linux and project specialists masquerading as VMS system admins on an ad-hoc basis."

Brad McCusker
Software Concepts International
VMSCheck
Advisor

Re: Two node cluster, but only one at a time is up

Well, I don't need professional help on this. I fixed it by myself. Sorry, I thought I could get some help here and most were saying I need to get help from support and if I have support why would I come here?

But I really thank Volker. You are the best and thank you for supporting to encourage people like me. Thank you again and I really appreciate.

 

 

 

abrsvc
Respected Contributor

Re: Two node cluster, but only one at a time is up

I'm glad that you resolved the problem.  I would request however, that you post the solution here (leaving out any site specific information) such that in the future someone else can benefit from your solution.

Dan

Volker Halle
Honored Contributor

Re: Two node cluster, but only one at a time is up

VMScheck,

I'm glad I could help you solve your problem.

For the benefit of others - and also myself - could you please describe the problem and your solution.

Thanks,

Volker.