Re: Two Queuemanagers without shared db on cluster.

SDIH1 · ‎07-17-2007

Hi,

We are building a multisite cluster, the main machines located on two sites, using a SAN also stretched over these 2 sites. We have foreseen placement of quorum node on a third site.

Due to economics it is not possible or even desirable to attach the quorum system to the SAN, it just boots from local disks, and provides a vote in the cluster.

Starting a separate queuemanger (i.e. a queuemanager not using the common queuemanager database) on this quorum node is unsupported, but it would be very handy to be able to print and submit batch jobs on this node. We also have considered MSCP mounting the disk where the queue database is residing on, but this is thought to be undesirable, considering the network load and instability this might cause.

From experience ( doing things wrong :-) ) I know having 2 queue managers using independent queue manager databases in a cluster does work,
but what are the actual risks of doing this?

Robert Brooks_1 · ‎07-17-2007

. . .considering the network load and instability this might cause.

---

Huh? What instability might be caused by MSCP-serving a disk?

In general, supported configurations are preferable to unsupported ones, for a production system.

Unless you are doing a staggering amount of queue manager stuff from the remote node, I can't imagine that the I/O load will be an issue.

-- Rob

-- Rob

Jan van den Ende · ‎07-17-2007

Jose,

I fully agree with Rob!

MSCP ,ount is hardly ever an issue.
And considering you obviously do not intend to mount your SAN disks on the quorum node, I expect not many jobs to run on the quorum node.
The occasional (print-, batch) job that IS run on the quorum node, certainly do NOT warrant the extra complexity!
Just MSCP-mount your QUEMAN$MASTER disk, and forget about it!

hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Hoff · ‎07-17-2007

I certainly wouldn't choose to run nor would I recommend a multiple-master queue database configuration within a production environment, but you're certainly welcome to choose to try this and to experiment here -- do consider reporting back with your findings, particularly as you gain experience with cluster and queue manager failures, and with the related transitions.

If you want to know if this is supported, you might want to contact HP directly and more formally. (AFAIK, there can be only one queue manager master database; all queue managers operating in a cluster must know about each other.)

Host queue files and such can certainly be local to a lobe, but I'd park the queue manager master file on the same disk with the authorization database.

Stephen Hoffman
HoffmanLabs LLC

Jiri_5 · ‎07-17-2007

We have 3 clusters with this configuration (2 nodes + quorum node with 2 separated queue managers) 7 years - it is working ok and is supported by HP.

SDIH1 · ‎07-18-2007

Thanks for the interesting feedback. It shows we were probably right about the alternatives, each having it's merits. I didn't read anyone having problems with the separate queuemanager setup, which was my main concern.

I would still be very interested in anyone having ever seen any problems using an independent second queuemanager.

I will also submit a formal request to HP
(with a pointer to this discussion) as to what exactly is supported and what not. The documentation says it's not, but this might
be not the current view anymore.

Robert Brooks_1 · ‎07-18-2007

I agree you should get a formal response, but I am rather confident that we do not support multiple queue managers.

Note that many things that are not "supported" may still work, sometimes even correctly. However, things that seem to "work" can mysteriously fail, frequently at very inconvenient times.

I still vigorously asset that MSCP-serving is the correct choice here.

-- Rob (ex-VMS Engineering, still an HP employee, who spent a fair amount of time digging into the MSCP server and DUDRIVER (the MSCP client)

John Gillings · ‎07-18-2007

Jose,

>I would still be very interested in anyone
>having ever seen any problems using an
>independent second queuemanager.

Yes I've seen problems. BIG problems, numerous times.

There is no such thing as an "independent second queue manager". Having two separate queue manager data bases is a very dangerous configuration. Although it SEEMS to work when you initially set it up, WHEN (not IF) you have a queue manager failover event you will lose all queues, entries, and forms. Splat! Gone, vapourised!

Why? Because no matter what you do, the multiple queue managers know about each other. On failover, the algorithm to recover doesn't understand that two nodes have completely different "views" of the data, therefore assumes it's all bad and deletes it all.

This is NOT a bug. The configuration does not work, was never intended to work, and will never work.

If you have a cluster, you can only have a single physical queue manager master file (the managers themselves have their own journal and queue definition files). There are numerous other files which MUST be physically shared between all cluster nodes. That's how clusters work. You need to have some shared storage area visible to all cluster nodes in which the common files live.

The queue manager has an additional constraint - the file specification used to reference QMAN$MASTER.DAT must be identical on all nodes.

(my preference would be to have queue managers refuse to start if they don't correctly reference the QMAN$MASTER used by existing queue managers)

A crucible of informative mistakes

SDIH1 · ‎07-18-2007

Interesting. So, if there never is a failover from the nodes running queuemanager 1 to nodes running queuemanager 2 it's ok?
That's configurable behaviour, did you ever try that?

Jan van den Ende · ‎07-18-2007

Jose,

>>>That's configurable behaviour, did you ever try that?
<<<
No, that is _NOT_ configurable.
Any time a node running a queue manager goes down, the manager is transfered. And if the node crashes, any surviving node takes over.
-- that is one of the things that make VMS clusters so resilient to various ways of failing hardware.

And John,
>>>
(my preference would be to have queue managers refuse to start if they don't correctly reference the QMAN$MASTER used by existing queue managers)
<<<

If you ever need any backing vote on this, mine is given herewith!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

SDIH1 · ‎07-18-2007

To elaborate: the main concern against using MSCP is that this is supposed to be a disaster tolerant cluster. Using MSCP would introduce a dependency of the quorum node on the other nodes serving the common disk.

The quorum node is not participating in providing production services, it's just there to keep the cluster from splitting.

Never will any jobs running on the 'real' production machines run on the quorum node or the other way around. Also this node will share almost no users, identifiers, processes or what you have with the 'real' production machines.

In the case of real trouble I am afraid of what MSCP would do when nodes serving the common disk start to get unreachable.

Initially we just went for this node having no queuemanager at all, as per the guidelines in the documentation. This proved to be a major issue for the management sofware that is supposed to run on all machines.

I'll try to schedule some tests this weekend
or early next week to see what the exact behaviour of the independent queue managers is in case of nodes leaving the cluster etc,
and respond with the results.

Jiri_5 · ‎07-18-2007

We have NodeA, NodeB + NOdeQ (quorum) and 2 queue manager, 1 for NodeA and nodeB, 2nd fore NodeQ:
1st queue manager:
Master file: CLUSTER$COMMON:[SYSEXE]QMAN$MASTER.DAT;

Queue manager SYS$QUEUE_MANAGER, running, on NodeA::
/ON=(NodeA,NodeB)
Database location: CLUSTER$COMMON:[SYSEXE]

second manager:
Master file: SYS$SYSROOT:[SYSEXE]QMAN$MASTER.DAT;

Queue manager SYS$QUEUE_MANAGER, running, on NodeQ::
/ON=(NodeQ)
Database location: SYS$COMMON:[SYSEXE]

Jan van den Ende · ‎07-19-2007

Jose,

>>>
Using MSCP would introduce a dependency of the quorum node on the other nodes serving the common disk.
<<<
If _NE_ of the production nodes is reachable, that will also mean MSCP will work. _IF_ the nodes are both not there (as seen from the quorum node), then your whole production environment is gone, making the disk holding queman irrelevant, or the production nodes still se one another, and happily go on "producing". Interestingly, the latter case is the one John warns about: the continuing part of the cluster no longer sees the quorum node, and so WILL be FORCED to start that quemanager on one of the prd nodes: the fatal scenario.

Please take John G's advise!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

SDIH1 · ‎07-19-2007

Jiri's configuration of queuemanagers was the one I had in mind. In his configuration the two queuemanagers are prevented from failing over to conflicting nodes by clever use of the /on qualifier ( no * in the node list) when starting the queuemanagers, unless he and the documentation are mistaken, which sounds unlikely.

Considering the contradictory answers and experiences of people, not even to mention the unclear status of support, I will do some testing, to actually see if it works, or not.

Volker Halle · ‎07-19-2007

I'm running the same configuration as shown by Jiri. If you prevent the QUEUE_MANAGER from ever failing over to another node, it should be fine.

You certainly cannot use the queues/printers on the other nodes. But for having a local batch queue on the 'quorum' node, this setup seems to work. The location of the qman database files as well as the nodes to be run on is stored in QMAN$MASTER.DAT. This file is opened by JOB_CONTROL, which then creates the QUEUE_MANAGER process. The related locks have a parent lock, which includes the device name on which the QMAN$MASTER.DAT file resides. This is supposed to be unique in a cluster !

If you want to run a 'pure' quorum node, you don't want to mount the disks in the SAN.

Volker.

Jiri_5 · ‎07-19-2007

We have the same configuration on 4 cluster which work 7 years and clusters are under HP gold support.

SDIH1 · ‎07-19-2007

Yes I see the locks: on all production nodes job_control holds a nl lock called qman$msr_dsa3 (dsa3 is the disk the queuemanager is on), and on the node running the queuemanager there is an EX child lock called qman$jbc_alive_01. There is no dsa3 device on the quorum node, as it is not connected to the SAN, there never will be.

If I understand Volker correctly, I have to take care not to put the 'quorum' queue database on a device called dsa3?

Volker Halle · ‎07-19-2007

You won't be able to create a 'local' DSA3: device on your quorum node, if there already exists a DSA3: shadowset volume in your prodction nodes within the cluster. This is also protected via cluster-wide locks.

Just put the 'local' qman database onto the system disk of your quorum node (in SYS$SYSTEM: as by default). The disk device name must be unique within the cluster anyway...

Volker.

SDIH1 · ‎07-19-2007

I know, the only way would be to mount the production DSA3 over MSCP :-).

This was a hypothetical question. In a cluster all disknames are supposed to be unique, aren't they? Local disks get the allocation class before the device name, shared disks are seen by all members, and so have to be unique.
So I didn't really understand your point, as by design the device name should be unique anyway.

Would there be a problem if you would make another qman$master.dat in another directory on the same disk?

No idea what happens if you would attach the quorum node to another SAN having the same DGA devices as the 'production' SAN, probably you would get 'unpredictable results', but that's another discussion.

Jiri_5 · ‎07-19-2007

Why do you want make another qman$master.dat in another directory on the same disk (DSA3), why you don't use sys$sysdevice of quorum node? We mount cluster disk (DSA from 2 $1$DGA from SAN) on quorum node too, but you need for it license for shadowing on quorum node. We use this this for HP product over all cluster (ISEE and SHC) but not for both qman$master.dat.

SDIH1 · ‎07-19-2007

We don't want that, I was just curious.

Volker Halle · ‎07-19-2007

Would there be a problem if you would make another qman$master.dat in another directory on the same disk?

No problem, the parent resource name includes the file-id of the QMAN$MASTER.DAT file. Not that this would make sense...

Volker.

SDIH1 · ‎07-22-2007

Ok, I did some testing. I started an independent queuemanager on the quorum node,
like this (nodename obfuscated) :

start/que/manager/new/on=(qnode)

show queue/manager/full shows this :
Master file: SYS$SYSROOT:[SYSEXE]QMAN$MASTER.DAT;

Queue manager SYS$QUEUE_MANAGER, running, on
QNODE::
/ON=(QNODE)
Database location: SYS$COMMON:[SYSEXE]

On the productioon node ( there are 4, A,B,C and D) the queuemanager was running on node C. A reboot of node C made the 'production' queuemanager shift to node 'A'.

A stop/queue/manager/cluster command on QNODE
made it stop the queuemanager on QNODE, but
NOT on the production nodes, as hoped.

A reboot of the Quorum node did not affect the queuemanager on the production nodes,
and in the places I looked ( operator.logs)
there was no evidence of any queuemanager
panicking on what to do.

I could have repeated this test 100 times, and rebooted all cluster nodes 100 times, but there really was no indication this would change the results, so I didn't.

Conclusions :
1 An independent queuemanager, provided the start/queue/manager has a carefully crafted nodelist in the /on qualifier works without problems.

2 Although this is expected behaviour given the qualifiers available for starting the queuemanager, it is rather challenging to extract this from the documentation.
Many people weren't up to this challenge.

Anyone suggestions for more tests or things to be tested?

Jan van den Ende · ‎07-22-2007

Jose,

STOP/QUEUE/MANAGER is a rather controlled way of terminatinh a queue manager.
As far as I understoog John Gillings' description, the real potential for trouble is when another node notices a remote queue manager gone, (because the queue manager crashed, the node crashed, of connectivity disappeared)
You did not report on any such "catastropy" scenario.

I am still very much in doubt on the wisdom of this confiruration.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Volker Halle · ‎07-22-2007

Jose,

try a STOP/ID of the QUEUE_MANAGER process, it will just be restarted.

The major issue is the correct specification of the QMAN$MASTER file location and it's contents, i.e. the node(s) to run on and the physical location of the QMAN database files.

I see one lock (QMAN$ORB_LOCK), which is not a child of the Master File Access Lock and therefore is NOT unique for each QMAN$MASTER.DAT file in the cluster. This could be a potential problem, but it's only being used, if you set ACLs on the queues.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Two Queuemanagers without shared db on cluster.

Two Queuemanagers without shared db on cluster.