1752729 Members
6003 Online
108789 Solutions
New Discussion

Broke It - Help!

 
Paul Hutchings
Super Advisor

Broke It - Help!

OK bit of a drastic subject line and fortunately it's an eval unit, but today we managed to do something we didn't expect.

Two P4500 12 disk nodes in a single cluster connected to the same switch.

We'd been pulling various cables/drives to simulate failure modes and with the FOM installed and running we were blown away at the whole "5 pings lost and it's back" thing when you pull a node completely.

However, then we pulled "Node 2" (we just pulled both NICs for speed) and the group IP was no longer pingable.

Upon investigation it would seem that what happened was that on "Node 1", the manager had somehow got stopped, so with us losing Node 2 deliberately all we now had left working was Node 1 with no manager, and a FOM.

We just could not connect to Node 1 as (I think) you connect to a management group by the group IP, which was dead, so I was in a chicken and egg where (I think) I knew I needed to start the manager on Node 1, but I couldn't log onto the group IP to do so because two managers were down so no quorum.

The solution was to reconnect the NICs to the second node so the quorum came back, and to then start the manager on the first node - but in a full on disaster we may not have that luxury.

What have I missed?