Re: Two Queuemanagers without shared db on cluster.

SDIH1 · ‎07-22-2007

Ok. I submitted 3 jobs ( doing a wait for 20 minutes) to the batchqueue on the quorum node. Then I killed the queuemanager process on the quorum node, the process came back immediately,all jobs still running.

I killed it one time more, the queue manager process came back, and then immediately again, but then the queue manager process did not come back, stating in operator.log INTERNALERROR, internal error caused loss of process status. A start/que/manager got it up again, the jobs were still running.

Killing the queue manager on the production nodes showed comparable behaviour ( I also stop/id'd the process 3 times) , also here the queuemanager did not come back after the third stop/id. Furthermore it favored the node it had been running on before, I didn't see a switch to another node.

I could reproduce the behaviour of the queuemanager process, i.e. usually coming back by itself, sometimes refusing to, on a standalone machine, so there seems to be no link between this behaviour and the two queuemanager scenario.

So, even after a heavy beating, the queuemanagers do not seem to be affected by each other. I don't feel a crash on a node would introduce a lot more stress than a stop/id of the queuemanager process would.

Ian Miller. · ‎07-22-2007

you should try crashing a node or two as well and unplugging cluster connections. The queue manager is being restarted when you stop it (by JOB_CONTROL?) which would not happen if the node crashed.

____________________
Purely Personal Opinion

Volker Halle · ‎07-22-2007

If the node crashes, there is no JOB_CONTROL to restart the QUEUE_MANAGER. There will be no failover attempt, as - in this configuration - the QUEUE_MANAGER is only allowed to run on the local (quorum) node. Once OpenVMS boots, JOB_CONTROL will start the QUEUE_MANAGER on the local node...

I really can't see a problem with this configuration - although it may be 'unsupported' by HP.

Volker.

SDIH1 · ‎07-22-2007

Ok. I crashed the quorum node, and after the boot the queuemanager was started ( as expected, as it did so after a normal reboot).
No signs of the queuemanager wanting to start on other nodes.

I also crashed node A that was running the queuemanager for the production machines. The queuemanager failed over to node B. No signs of the queuemanager trying to start on PDCC0E.

No jobs missing, apart from the ones executing at the time of the crash without restart and/or retain options.

It all looks pretty robust. The technical merits are clear, if it is wisdom to implement such a configuration depends not only on technical merits but also on other more arbitrary considerations, like personal preference, operational skills in an organisation and, if you're really unlucky, existing policies and prejudice.

I'm waiting for HP to give an answer on the question if this configuration is officially considered to be supported, which in my case is one of those arbitrary considerations.

Thomas Ritter · ‎07-23-2007

We run a 4 node disaster tolerant cluster. 2 nodes at each site working over two fibre optic links about 4 and 8 km in length. We run a single queue manager. Each node offers identical services. A major component of the system administrators duties is to ensure that no one node is over burdened with work. The idea being that the cluster will only run as fast as the slowest node ! We run Oracle/RDB with global buffering enabled meaning millions of locks generated. Lock tree bouncing and CPU saturation are very real risks in our environment. We use the queue manager to ensure that the work load is equitably balanced and that like type work, with respect to database access, is perform on the same node(s). We achieve our workload spread by having carefully crafted /AUTOSTART_ON=() lists. Almost every queue is autostart enabled. When a host is shutdown, autostart is disabled and the queues fail over gracefully. At vms startup all the queues automatically balanced back to the first entry in the autostart list.

If we ran separate queue managers, we would lose a lot of the flexibility we currently rely on.

Jon Pinkley · ‎07-23-2007

Thomas Ritter>>>"We run a 4 node disaster tolerant cluster. 2 nodes at each site working over two fibre optic links"

What are you using for a quorum tie breaker?

While I understand your concerns about the lock manager in general, I don't think it will ever be an issue on a node used just as a tie breaker.

Jose (sdi-1) is configuring a quorum node, and I don't think he wants production jobs running there.

----------------------------

Jose,

Volker's response from Jul 19, 2007 13:13:47 GMT list the limitations of running separate queue managers, specifically the invisibility of the queues on the other nodes. What type of batch job were you planning to run on the quorun node? You've stated that you don't plan to mount the drives from the production servers, so that's going to limit what you can do.

If you want to be able to print files from the production nodes, you can telnet to one of the other nodes and do the work there. And you can create print queue using TCPIP$TELNETSYM that autostart on more than a single node, but that would require a printer with raw tcpip capability (like a jetdirect print server) at the quorum site.

I think what I would do is MSCP serve the quorum node's system disk (so it would be possible to print files from it) and just run the queue manager on the production nodes, not even starting it on the quorum node. Then if I was at the quorum site, I would telnet to one of the production nodes, and do my work using one of those servers.

In summary: I don't see any technical reason that what is being discussed here wouldn't work within the limitations discussed. However, my preference would be to have a single queue manager database, just like rest of the cluster common files (e.g. SYSUAF, RIGHTSLIST, etc.)

it depends

SDIH1 · ‎07-23-2007

The separate queuemanager is only meant to start a few batchjobs for housekeeping and management on the Quorum node. We don't need printing or whatever. As these batchjobs are run by 'standard' management software, the alternative of changing these few batchjobs to detached processes would involve an overhaul of this 'standard' management software, and is considered too much work and risk for this one exception.

What's more, we don't use any of the cluster-wide databases on the quorum node, as only system managers would need to log in to this node, if ever.

As said before, the disadvantage of using MSCP is that you create dependencies between the quorum node and the rest of the cluster, that could bite you if real trouble arrives, possibly preventing the quorum node from doing it's main job : act as a tie-breaker for the production nodes.

Bart Zorn_1 · ‎07-23-2007

Jose,

A quorum node does not need access to ANY disk to perform its function: being a quorum node. Not even its own system disk!

I do not think that using MSCP served disks on the quorum node (for whatever purpose) will interfere with the proper operation of tie-breaking for the production nodes.

After all, the quorum node is just another cluster member. The same rules for cluster connectivity apply to all cluster members. Unless you have a quorum disk (which is unlikely if you have a quorum member), accessibility of disks does not influence cluster connectivity.

Regards,

Bart Zorn

SDIH1 · ‎07-23-2007

Althouhgh it's probably possible that a node
losing it's system disk might be alive enough to cast it's vote in the cluster, it wouldn't be an overly reliable system would it?

Thomas Ritter · ‎07-24-2007

Jon Pinkly,

What are you using for a quorum tie breaker?

We use a HP windows based system called DTCS. One is located at each site. Uses the same quorum adjust mechanism as AMDS. Connect via RDP from the WAN. Very useful.

Bart Zorn_1 · ‎07-24-2007

Hi Jose,

it depends on what you expect from a reliable system!

If a quorum systems is up and running, and you do not expect it to do anything else, it is very reliable, with or without access to its system disk.

During cluster state transitions, there is nothing that OpenVMS needs to fetch from the system disk. Once the transitions are over, not having access to the system disk may be a problem for other things running on the quorum node.

Regards,

Bart

Ian Miller. · ‎07-24-2007

FYI DTCS is sold as a service package so you can not just buy the program for windows to regain quorum. Availability Manager is also useful in these circumstances.

____________________
Purely Personal Opinion

SDIH1 · ‎07-24-2007

In case of real trouble, having the quorum system able to boot from it's own systemdisk and not being dependent on disks in the greater cluster gives you more options in the process of recovering the cluster from failure.
You can discard these options in favour of easier maintenance of the cluster as a whole, but then you definitely stray from the disaster tolerant path.

Depending on your design goals, operational requirements and the time you're allotted to recover from failures and disasters you may choose to tilt the balance toward easy management versus robust recovery possibilities. In my case, the cluster is supposed to be designed with disaster not as a possibility in the sideline, but as a certainty that has to be addressed with every means available.

Art Wiens · ‎07-25-2007

Thomas, do you have some links to this "HP windows based system called DTCS"? Had a quick Google and didn't see anything obvious.

I don't have to have an old VAX or Alpha (ie. VMS system) as a quorum node? This could be quite useful to me!

Cheers,
Art

SDIH1 · ‎07-31-2007

Everything has been said. Up for new adventures!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Two Queuemanagers without shared db on cluster.