Re: Availability Manager Analyzer on WIndows 7 could not fix quorum

Gesmundo · ‎08-23-2012

We have a cluster of Itanium blades across 2 sites (4 on primary and 3 on secondary) running OpenVMS 8.3-1H1. Availability Manager V3.0-2A runs on all nodes and the server is also running on one of the nodes in the secondary site. We are running Availability Manager Analyzer 3.1-2 on our laptop running Windows 7. We used the IP of the node running the Availability Manager Server to monitor the cluster.

Using the availability Manager Analyzer, we can monitor the VMS nodes and crash any of the nodes. We simulated a primary site crash by crashing the primary site nodes leaving the secondary site nodes in hang state. We can see from the Analyzer each icon being greyed out after it was crashed. When all 4 primary nodes crashed, all icons turned grey also. We are not able to fix the quorum from any of the remaining secondary site nodes because the FIX option is greyed out.

We are using the triplet *\1DECAMDS\c on our VMS nodes. We were forced to use the ILO to fix the quorum.

ANy suggestion is welcome to solve this issue.

Thanks.

Noel

Richard Brodie_1 · ‎08-24-2012

You should not run the Availability Manager Server on a managed node, particularly a clustered one. It is a normal user mode application, and will hang if the cluster loses quorum.

Andy Bustamante · ‎08-24-2012

>>> It is a normal user mode application, and will hang if the cluster loses quorum.

As Richard points out, by running Availability Manager Server on a node in the cluster, if the cluster isn't available, the Availability Manager Server will also not be available. Your options would be to either bring up an emulated VMS system with Availability Manager Server installed outside the cluster, or a Windows based system. I'd consider doing that at each site. I used to address this situation by running AM on a local Windows server and configuring VPN access, one AM system at each site. That was before the AM rely server option was released.

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

John Gillings · ‎08-26-2012

Slightly off topic...

>We simulated a primary site crash by crashing the primary site nodes leaving the secondary site nodes in hang state

Please be sure you understand exactly what you're simulating. Using AM to crash nodes serially does NOT simulate losing a site because you can't synchyronise the crashes. What you actually get is two crashes in quick succession. Look at the logs to see the true sequencing of events.

This may be adequate for your purposes, but if you need to test a more realistic site failure, you have to be a bit more inventive. Here are some DANGEROUS programs which can be used to simulate a site failure by synchronising the target nodes so they crash as simultaneously as possible.

        .TITLE gate
;
;  implements a "starter gate" using locks
;
 $LCKDEF
        .PSECT data,rd,wrt,noexe,quad
lksb:  .BLKL 2
res: .ASCID /StarterGate/

        .PSECT code,rd,nowrt,exe
        .ENTRY start,^M<>
          $ENQW_S efn=#0 lkmode=#LCK$K_EXMODE -
                  lksb=lksb flags=#LCK$M_NODLCKWT -
                  resnam=res
          $HIBER_S
          RET
        .END start

        .TITLE SiteFailover
;
;       Deliberate crash of a system, synchronized across multiple nodes
;       using a "starter gate" lock
;
    $LCKDEF
        .PSECT data,rd,wrt,noexe,quad
lksb:  .BLKL 2
res: .ASCID /StarterGate/
        .PSECT code,rd,nowrt,exe
        .ENTRY start,^M<>
          $ENQW_S efn=#0 lkmode=#LCK$K_CRMODE -
                  lksb=lksb flags=#LCK$M_NODLCKWT -
                  resnam=res
;         $CMKRNL_S die         ; commented out for safety
          MOVL #40,r0
        RET
        .ENTRY die,^M<>
          CLRL R0
;         MOVL (R0),R0          ; commented out for safety
        RET
        .END start

Start by running the "starter gate" program as a subprocess. It takes out an exclusive lock on the "StarterGate" resource, and then hibernates. Now run the SiteFailover program on each node that you want to crash. (Realise that you'll need to remove the safety comments for it to work, and the processes will need CMKRNL privilege). They will all request the lock being held by the starter gate. When you're ready, kill the starter gate process. It will drop the lock, which will release all the killer processes, crashing all the systems before they have time to detect each other. This will give a more realistic site failure crash than serially crashing nodes.

I call our killer program "DELIBERATE_SYSTEM_CRASH" and build it only when needed (indeed, the image is deleted before releasing the StarterGate). This helps prevent accidents and means the crash dumps have a clear indication that they crash was intentional.

A crucible of informative mistakes

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Availability Manager Analyzer on WIndows 7 could not fix quorum

Availability Manager Analyzer on WIndows 7 could not fix quorum