Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

broken cluster after autogen

 
SOLVED
Go to solution
Markus Waldorf_1
Regular Advisor

broken cluster after autogen

I have setup a cluster using 2 DS20e, RA3000 shared scsi bus and Memory Channel. Everything was working ok until I decided to add some system parameters. Well, I was also trying to consolidate the parameters and include a cluster modparams file.

After "autogen savparams setparams feedback" I rebooted the 1st node in the cluster using "reboot" and "show cluster" on the 2nd node shows that node NW1 is BRK_NON and it does not come up anymore. I will have to go to work today to check the console.

Below are the parameter files for each node. What I changed on node NW1 from the previous working parameters are NISCS_LOAD_PEA0 from 0 to 1 and INTERCONNECT MC to NICIMC, but is this is the reason it does not come up anymore.

Does someone know what could be wrong pls, and how to correct the situation?


node NW1 sys$system:modparams.dat

VOTES=1
SCSNODE="NW1"
SCSSYSTEMID=1025
AGEN$INCLUDE_PARAMS SYS$MANAGER:AGEN$NEW_NODE_DEFAULTS.DAT
AGEN$INCLUDE_PARAMS SYS$COMMON:[SYSEXE]CLUSTER$MODPARAMS.DAT


node NW2 sys$system:modparams.dat

SCSNODE="NW2"
SCSSYSTEMID=1026
AGEN$INCLUDE_PARAMS SYS$MANAGER:AGEN$NEW_NODE_DEFAULTS.DAT
AGEN$INCLUDE_PARAMS SYS$COMMON:[SYSEXE]CLUSTER$MODPARAMS.DAT


SYS$COMMON:[SYSEXE]CLUSTER$MODPARAMS.DAT

!*** NW app
!**
MIN_GBLPAGES = 140000
MIN_MAXBUF = 8192
MIN_PQL_MWSQUOTA = 4000
MIN_PQL_MWSEXTENT = 6000
MIN_PQL_MASTLM = 100
MIN_PQL_MBIOLM = 100
MIN_PQL_MBYTLM = 100000
MIN_PQL_MDIOLM = 100
MIN_PQL_MENQLM = 300
MIN_PQL_MFILLM = 100
MIN_PQL_MPRCLM = 10
!
!*** Cluster
!*
NISCS_LOAD_PEA0=1
! Loads the port driver PEDRIVER. Set NISCS_LOAD_PEA0=0
! to disable the LAN for cluster communication
MC_SERVICES_P2=1
! Sets MC_SERVICES_P2 to 1 to load the PMDRIVER (PMA0) cluster driver.
! This system parameter enables MEMORY CHANNEL on the local
! computer for node-to-node cluster communications.
VAXCLUSTER=2
! VAXCLUSTER system parameter must be set to 2 if the NISCS_LOAD_PEA0
! parameter is set to 1. This ensures coordinated access to shared
! resources in the cluster and prevents accidental data corruption.
DISK_QUORUM="$2$DKA2"
QDSKVOTES=1
ALLOCLASS=1
INTERCONNECT="NICIMC"
BOOTNODE="NO"
DEVICE_NAMING=1
!
!*** Swap, Page, Dump
!*
SWAPFILE = 0
PAGEFILE = 0
DUMPFILE = 14
! Dump is located in NW1/2$DKB500:[sys0/1.sysexe]sysdump.dmp
! Disk Defined by console var.
!
!*** The following entries have been added by the RMS remedial kit
!*
ADD_PIOPAGES = 100 !***Entered by RMS (ASB increase)
ADD_IMGIOCNT = 100 !***Entered by RMS (ASB increase)
MIN_PIOPAGES = 675 !***Entered by RMS (ASB increase)
MIN_IMGIOCNT = 228 !***Entered by RMS (ASB increase)
!
!*** DEGPA-TA Gigabit Ethernet
!*
ADD_NPAGEVIR = 1500000
!
!*** Volume Shadowing - just in case
!*
SHADOW_SYS_DISK = 1
SHADOW_SYS_UNIT = 0
SHADOWING = 2
SHADOW_MAX_COPY = 2
!
!*** No special purpose - just in case
!*
ADD_NPAGEDYN = 600000
ADD_GBLPAGES = 350000
MIN_GBLSECTIONS = 3000
LOCKIDTBL = 4096
!
!*** Create a startup log file
!*
! /OUTPUT=FILE,CONSOLE (default)
! Sends output generated by using the /VERIFY qualifier to a file or to the
! system console. If you choose the FILE option, it creates
! SYS$SPECIFIC:[SYSEXE]STARTUP.LOG.
!
! /NOCHECKPOINTING (default)
! /CHECKPOINTING
! Displays information messages describing the time and status of each
! startup phase and component procedure.
!
! /VERIFY=FULL (default),PARTIAL
! /NOVERIFY
! FULL Displays every line of DCL executed by startup component
! procedures and by STARTUP.COM
! PARTIAL Displays every line of DCL executed by startup component
! procedures, but does not display DCL executed by STARTUP.COM
!
! SYSMAN SET STARTUP OPTIONS/OUTPUT=FILE/VERIFY=FULL
! STARTUP_P2 LETTER: D F
! SYSMAN SET STARTUP OPTIONS/OUTPUT=FILE/VERIFY=PARTIAL/CHECKPOINT
! STARTUP_P2 LETTER: D P C
!
STARTUP_P2 = "DCP"
!
12 REPLIES 12
Jeremy Begg
Trusted Contributor
Solution

Re: broken cluster after autogen

Markus,

I don't see anything obviously broken but it would help to know what SYSGEN parameters actually changed when you ran AUTOGEN. Whenever I run AUTOGEN I do this:

$ diff/par sys$system:setparams.dat

to see what will change when the system gets rebooted.

Your CLUSTER$MODPARAMS.DAT enables volume shadowing, was it enabled on NW1 previously?

Regards,
Jeremy Begg
Markus Waldorf_1
Regular Advisor

Re: broken cluster after autogen

Thanks for the reply. I don't use volume shadowing, but we have a license for it, and since there is only a standard scsi controller for the internal drive bay... I certainly don't use it for the quorum, and probably not for the system disk ever. I put it in so I'm prepared. Comparing the setparms.dat is a good tip. I notice a few changes, in particular the "expected votes". How did this change? Anyway... I'going to work now to see what the console writes...

SYSTEM@NW2>diff/par SETPARAMS.DAT;8 SETPARAMS.DAT;2
-----------------------------------------------------------------------------------------------------------------------------------
File DISK$ALPHASYS:[SYS0.SYSEXE]SETPARAMS.DAT;8 | File DISK$ALPHASYS:[SYS0.SYSEXE]SETPARAMS.DAT;2
-------------------------------- 5 --------------------------------------------------------------- 5 ------------------------------
set GBLSECTIONS 3000 | set GBLSECTIONS 600
set GBLPAGES 256525 | set GBLPAGES 150000
set GBLPAGFIL 1024 | set GBLPAGFIL 1024
set MAXPROCESSCNT 668 | set MAXPROCESSCNT 835
-------------------------------- 13 -------------------------------------------------------------- 13 -----------------------------
set SWPFILCNT 3 | set SWPFILCNT 2
set SYSMWCNT 10933 | set SYSMWCNT 11754
set BALSETCNT 666 | set BALSETCNT 833
set WSMAX 1091584 | set WSMAX 1091584
set NPAGEDYN 5305312 | set NPAGEDYN 5267456
set NPAGEVIR 22721248 | set NPAGEVIR 26337280
set PAGEDYN 6881280 | set PAGEDYN 7512064
-------------------------------- 23 -------------------------------------------------------------- 23 -----------------------------
set MPW_LOLIMIT 1998 | set MPW_LOLIMIT 2499
set MPW_IOLIMIT 4 | set MPW_IOLIMIT 4
set MPW_THRESH 3996 | set MPW_THRESH 4998
-------------------------------- 36 -------------------------------------------------------------- 36 -----------------------------
set FREELIM 686 | set FREELIM 853
set FREEGOAL 2664 | set FREEGOAL 3332
set GROWLIM 686 | set GROWLIM 853
set BORROWLIM 686 | set BORROWLIM 853
set CLISYMTBL 512 | set CLISYMTBL 512
set LOCKIDTBL 4096 | set LOCKIDTBL 1792
set RESHASHTBL 8192 | set RESHASHTBL 2048
set SCSBUFFCNT 512 | set SCSBUFFCNT 50
set SCSCONNCNT 10 | set SCSCONNCNT 5
-------------------------------- 75 -------------------------------------------------------------- 75 -----------------------------
set ACP_HDRCACHE 1999 | set ACP_HDRCACHE 1666
set ACP_DIRCACHE 1666 | set ACP_DIRCACHE 1666
set ACP_DINDXCACHE 416 | set ACP_DINDXCACHE 416
set ACP_QUOCACHE 668 | set ACP_QUOCACHE 835
set ACP_SYSACC 66 | set ACP_SYSACC 69
set ACP_SWAPFLGS 14 | set ACP_SWAPFLGS 14
set VAXCLUSTER 2 | set VAXCLUSTER 2
set EXPECTED_VOTES 3 | set EXPECTED_VOTES 1
-------------------------------- 87 -------------------------------------------------------------- 87 -----------------------------
set LOCKDIRWT 1 | set LOCKDIRWT 0
set NISCS_LOAD_PEA0 1 |
-------------------------------- 93 -------------------------------------------------------------- 92 -----------------------------
set STARTUP_P2 "DCP" |
-------------------------------- 98 -------------------------------------------------------------- 96 -----------------------------
set SHADOWING 2 | set SHADOW_MAX_COPY 1
set SHADOW_SYS_DISK 1 | set ZERO_LIST_HI 8192
set SHADOW_MAX_COPY 2 | set GH_EXEC_CODE 512
set ZERO_LIST_HI 8192 | set GH_EXEC_DATA 128
set GH_EXEC_CODE 512 |
set GH_EXEC_DATA 192 |
-------------------------------- 113 ------------------------------------------------------------- 109 ----------------------------
set PIOPAGES 675 | set PIOPAGES 575
set CTLPAGES 256 | set CTLPAGES 256
set IMGIOCNT 228 |
-----------------------------------------------------------------------------------------------------------------------------------

Number of difference sections found: 9
Number of difference records found: 43

DIFFERENCES /IGNORE=()/PARALLEL-
DISK$ALPHASYS:[SYS0.SYSEXE]SETPARAMS.DAT;8-
DISK$ALPHASYS:[SYS0.SYSEXE]SETPARAMS.DAT;2
SYSTEM@NW2>
Robert Gezelter
Honored Contributor

Re: broken cluster after autogen

Markus,

Possibly SHADOW_SYS_DISK? Turning shadowing on is not a problem, but SHADOW_SYS_DISK?

- Bob Gezelter, http://www.rlgsc.com
Markus Waldorf_1
Regular Advisor

Re: broken cluster after autogen

Hi,

The console showed a bugcheck when trying to create a shadowed system disk - silly me, I remember now that I ran into the same problem 10 years ago. It also showed %PEA0, port transition failure and failing to find vmscluster security database, but that was not the show stopper. I disabled shadowing of the system disk and loading of the PEA driver. I'm not going to use anything else than Memory Channel anyway.

P00>> b -fl 0,1
SYSBOOT>> use current
SYSBOOT>> show/all (F1 to stop screen)
SYSBOOT>> set shadow_sys_disk 0
SYSBOOT>> cont

... after that all ok, I modifed my cluster$modparams.dat to:
SHADOW_SYS_DISK= 0
NISCS_LOAD_PEA0=0
INTERCONNECT="MC"

another
$ @sys$udpate:autogen savparams setparams feedback
$ reboot
.. and everything is in the clear again

Thanks,
Markus

Volker Halle
Honored Contributor

Re: broken cluster after autogen

Markus,

if you're using a common system disk or if the system disk of NW1 is mounted from node NW2, then NW1 cannot mount it's system disk as a shadowed system disk (requested by SHADOW_SYS_DISK=1) during boot and will hang or crash.

EXPECTED_VOTES=3 is expected, when both nodes have VOTES=1 and you're using a quorum disk with QDSKVOTES=1.

Volker.
Robert Gezelter
Honored Contributor

Re: broken cluster after autogen

Markus,

You might want to put the PE support back in and check the hardware connection data path. If both nodes have support for the PE cluster communications, it can serve as an alternate path.

Note that BOTH nodes require the support. I am not surprised that one node not finding anyone else on the cable would generate an error message. Then again, so would a disconnected cable.

- Bob Gezelter, http://www.rlgsc.com
comarow
Trusted Contributor

Re: broken cluster after autogen

The first thing I must say.
Autogen did not break your system.
It was that sysgen changes were made to the
system and not documented or completed in modparams.dat. Then autogen did what it was supposed to.

If your making dramatic changes to the system, I would suggest that if you don't have a way to get to the console, which could be as simple as a terminal server line going into the serial port, you had better be there to see what is going on. If this is a play cluster, then please excuse my
somewhat harsh tone.

People constantly blame Autogen saying it broke the system. It's a pet peeve of mine.
Though there have been a few cases where autogen has problems, they are exceptions.

You have device naming set to one, so I have no idea what the port allocation class of the system disk is. It is possible that the port allocation class is zero but unlikely.

When you boot, at what point does it hang?
Once again, you need to be there.

Or perhaps there's an operator or night watchman that can serve as your eyes?

In fact, are you sure you are even booting
into the right root? You probably are but
without seeing it, that would be a possible
problem.

You need to get there if your system is important and your job is to make it run.

At sysboot> you can do a show/cluster
on both nodes or at sysgen.

Summary, from a distance I'd start with
dif/par setparams.dat but you'll still need to get to the console.

But if it were my system I'd already be there. There was one tool at my old site call VAX Cluster Console, which unfortunately owned by CA. But it lets you
set up a way to get to all your consoles remotely. It was once a DEC product.

A comment. You have so much information you the leaves hide in the trees.

When during boot you put out too much information, it's too easy to lose the important information. At my last site
they sent hundreds of useless emails every time the system did anything, that it was like a shotgun and if you missed a pellet
folks got upset. Really a bad practice.
It should send information when something doesn't work.

That why things like Availability Manager turn red when something goes wrong.



dif/par setparams.dat will make it easy
to see what changed when you ran autogen.

By the way, since you have niscs_load_peao
if you have a working network, it will cluster over the network even if the memory
channel is not working.

You have lots of Add_parameter. This creates voodoo, as we really don't know what the parameter is. You are far more likely to know your paramaters by replacing them all with MIN_PARAMETER=X
so you know at least what it is.

Also, Agen$feedback.report often provides
useful information on the autogen, including
parameters multiply defined, misspelled,
information on your feedback, and changes it makes, changes it wants to make. That's something you can get to remotely.

I don't know what's in include Agen$newnode.dat. I have this feeling that it's hung trying to join the cluster with information from the other node. Just a hunch.

Have fun.





Because you set some parameters that worked and they were not the same in modparams.dat,
that is what broke your system.
Markus Waldorf_1
Regular Advisor

Re: broken cluster after autogen

I'm aware that it was not autogun as such that caused the problem, but running autogen after I made changes to modparams.dat ;-)

Regarding the PE driver, well I tried to enable it again:
INTERCONNECT="NICIMC"
NISCS_LOAD_PEA0=1
Autogen, restart.

both nodes started up but showed error at the console:
VMScluster security database not found,
%PEA0, port transition failure

What port/cable is it using? Both computers are connected and I can do a "set host to each other". I put the old settings back for now, disable the PEA driver and "MC" for interconnect.

I also figured that I had a wrong setting:
DUMPFILE=14, should be DUMPSTYLE=14

And added DUMPFILE=0, and EXPECTED_VOTES=3

Thanks,
Markus
Markus Waldorf_1
Regular Advisor

Re: broken cluster after autogen

Btw,
>EXPECTED_VOTES=3 is expected, when both nodes >have VOTES=1 and you're using a quorum disk >with QDSKVOTES=1.

I'm using EXPECTED_VOTES=3 on both nodes, but VOTES=1 only on 1 of the 2 members. I was reading that only one should have VOTES=1
Hoff
Honored Contributor

Re: broken cluster after autogen

> I'm using EXPECTED_VOTES=3 on both nodes, but VOTES=1 only on 1 of the 2 members. I was reading that only one should have VOTES=1

URL or reference, please?

Save for very specific circumstances when you might "play games" or "get creative" with the system parameter settings of votes and expected_votes to get a particular configuration to initially boot or such, whatever you were reading was either confusing, mis-worded, or was telling you how to corrupt your disks.

For a two-node cluster with no shared storage, read this:

http://labs.hoffmanlabs.com/node/967

For more general information, read this:

http://labs.hoffmanlabs.com/node/153
comarow
Trusted Contributor

Re: broken cluster after autogen

Congratulations.

That must feel great. In 10 years of tech support for VNS. I heard so many people blame autogen.

Now that you found it, check where all the add_params are set to and change them to
mid_params=actual value.

Did you find it dif/par setparams.dat?
I always did this after running autogen,
looking to make sure especially it didn't change any cluster of shadow parameters.

I don't know your site but having a pair of eyes there at night you can depend on can
make your life easier.

So, it was hanging ready to form or join a cluster.

Is your quorum disk seen by each system
at the boot prompt?

I really think simplifying your modparams.dat
will greatly help diagnosing in the future.

If you reviewed agen$feedback.dat the change would have been there, along with so much other information you miss it. Dif/par would have made the change more obviouis.

And now, joy of joys, you know you have an autogenable system.

Have fun.

System management, 90% simple stuff,
9% interesting and learning stuff, and
1% utter panic. Somehow we forget the
98% skill and easy stuff, and focus on
the 1% of utter panic!

Please don't forget to add points.

I miss this, I worked tech support for 10 years and two of my specialities were clustering and boot, and your call was both.



Markus Waldorf_1
Regular Advisor

Re: broken cluster after autogen

Sorry, i cannot find the reference anymore, but it does not matter. I know your information is correct.

I checked with "mcr sysgen - use current - show votes" and the current and default value for votes = 1. I guess that is the reason why it works, even though I do not have votes=1 specified on one of nodes. I will correct it.

Thanks for all the info.