StoreVirtual Storage
1747983 Members
4464 Online
108756 Solutions
New Discussion

Re: P4300 Crashed w/ loss of 1 node

 
Sean Petty
Occasional Contributor

P4300 Crashed w/ loss of 1 node

We have a P4300 SAN with 2 nodes, both at the same location, same rack.  The two nodes are on the same subnet, 172.20.254.0/24, with one being .11 the other .12, and the VIP is .10.  We have a FOM configured as .13  The ESX cluster is 4 hosts, and there are (5) 500GB datastores on the overall SAN.

 

Today we upgrade from SAN/iQ 8.5 to 9.0, and it was a disaster.  Both nodes were going offline, the VIP wasn't responding to pings, the upgrade failed about a dozen times at various incrementing levels.  When we finally got done and were at 9.0, we wanted to see what would happen if we lost a single node, so we disconnected the iSCSI network from one node.  The VIP immediately stopped pinging, as did that node (obviously) and then a short time later the second node stopped pinging and everything froze. 

 

When it comes back online, we can pull the cables on the second node, and the same thing happens.  The VIP stops, and after 20 or 30 seconds the other node stops responding. 

 

It would seem that something must be mis-configured, because we should be able to lose a single node and still keep running.  When we look at the management group there are 3 managers running: node1 (coordinating), 2 regular managers, and 1 failover manager.  The FOM is a VM within the ESX cluster. 

 

Can anyone suggest where we may have gone wrong? 

9 REPLIES 9
Emilo
Trusted Contributor

Re: P4300 Crashed w/ loss of 1 node

So was this running for sometime before you did the upgrade?

If so did you test this prior to the upgrade?

During the upgrade did you experince similar outages?

Did you get messages regarding loss or quorum?

When you mention that you have the FOM installed on the ESX cluster is it on the SAN storage?

If so that will not work, you cannot have the FOM installed on the SAN storage it needs to be on seperate media.

 

Sean Petty
Occasional Contributor

Re: P4300 Crashed w/ loss of 1 node


@Emilo wrote:

So was this running for sometime before you did the upgrade?

If so did you test this prior to the upgrade?

During the upgrade did you experince similar outages?

Did you get messages regarding loss or quorum?

When you mention that you have the FOM installed on the ESX cluster is it on the SAN storage?

If so that will not work, you cannot have the FOM installed on the SAN storage it needs to be on seperate media.

 


Emilo,

 

Yes, it was running for some time - probably 2 years before we did the upgrade.  We had never tested failover prior to the upgrade.  We experienced a number of outages during the upgrade - times when the VIP wouldn't respond, but the two nodes were pinging.  We did get loss of quorum messages.  The FOM is on the SAN storage, so we'll have to move that off.  What are best practices for the FOM?  A separate physical machine, installing it on the physical drives of one of the ESX hosts, or even on a separate SAN?  Can we run more than 1 FOM to put it on multiple ESX host local datastores?

 

The other question in all this is the VIP.  It's not clear to me why the VIP doesn't respond when one of the two hosts is up.  Do we need multicast ARP entries to point to the various NICs in the SANs?

Sean Petty
Occasional Contributor

Re: P4300 Crashed w/ loss of 1 node


@Emilo wrote:

So was this running for sometime before you did the upgrade?

If so did you test this prior to the upgrade?

During the upgrade did you experince similar outages?

Did you get messages regarding loss or quorum?

When you mention that you have the FOM installed on the ESX cluster is it on the SAN storage?

If so that will not work, you cannot have the FOM installed on the SAN storage it needs to be on seperate media.

 


Emilo,

 

We moved the FOM to one of the local datastores on an ESX host, and tested again.  We disconnected the iSCSI cables from one node, and within about 30 seconds some desktops lost all function, but others didn't.  I had a constant ping running to the two nodes, the FOM, and the VIP.  The FOM never lost a ping, but I started timing out to the node we disconnected first, then the VIP, then the second node.  By the end, neither node nore VIP were returning pings, and some of the VMs lost functionality, but some (like the CMC) still kept working without a problem.

 

If everything is properly configured, we shouldn't ever lose pings to the VIP, right?

Emilo
Trusted Contributor

Re: P4300 Crashed w/ loss of 1 node

Yes it should stay up.

What is the CMC showing during these outages?

Is everything yellow or does it have red x's ?

Do you have seperate switches and if so are they trunked?
Do you have the NIC;s in a team?

On your ESX hosts you have the iscsi connection configured with the VIP holder?

 

Uwe Zessin
Honored Contributor

Re: P4300 Crashed w/ loss of 1 node

 > We moved the FOM to one of the local datastores on an ESX host

 

The FOM must NOT be stored on a volume that is provided by a cluster for which the FOM provides quorum. It must live on independend storage.

.
NickPham
Occasional Visitor

Re: P4300 Crashed w/ loss of 1 node

Hi Sean,

 

I have a similar issue. Did you manage to resolve your issue?

 

Thanks.

Emilo
Trusted Contributor

Re: P4300 Crashed w/ loss of 1 node

Just put the FOM on ESX local storage.

The VIP should move over to the other host.

However it sounds like you have some network issues.

 

David_Tocker
Regular Advisor

Re: P4300 Crashed w/ loss of 1 node

Yeah sounds like you are loosing quorum perhaps? due to network issues? Slow ARP update?

Look into the number of managers you have (FOM), manager(s) on P4000 as well.

Look into the documentation for some examples of the numbers you need.

 

As Emilo pointed out, the FOM -MUST- stay up, even if both nodes fail.

 

As far as switching goes, you will be safer with ALB, although concurrent write performance goes down a little, it is pretty insignificant in my own opinion in most cases - even when running LACP my understanding is that you can only write at 1Gb as it will not 'double' the link.

Running LACP would in theory allow two servers to write to the single node simultaneously (or close to it)

I would be really interested to here from any real experts on this, who doesnt want to try and maximize the  performance of their san?

 

If you have only two nodes and two switches which are not stacked, you are probably better to have one node plugged into each switch to help avoid issues with ARP transitions.

 

I could go on all day about all the 'gotchas' that you can potentially have with these units - or any other similar 'clustered' iSCSI unit for that matter. (personally I think a second dual port nic in each unit running a ring bus would do wonders for performance and stability on these nodes, ala Cisco stackwise)

 

I have been putting some thought into what happens when a switch 'dies' with the link still up.....

 

Could you please let us know what your switching setup is?

 

Important things to know:

 

a) Switch model(s)

b) Stacked?

c) ALB, LACP, HSB from the nodes?

d) Spanning-tree? PVST, MSTP, PVSTP etc?

e) Uplink to cluster nodes? LACP trunk, Etherchannel etc?

f) If switches are not stacked, are they connected to each other directly? (could cause weird STP issues)

 

 

Regards.

David Tocker
NickPham
Occasional Visitor

Re: P4300 Crashed w/ loss of 1 node

I believe moving the FOM to a local ESXi Storage will resolve my issues. I'm losing the quruom when I shut down SAN node B. Can't confirm if it occurs when shutting down SAN Node A. The management console states that  Node A is the coordinating manager, 3 managers are configured, 2 regular and 1 Failover manager. Special manager is Failover Manager and Quorum is 2.

 

A 3rd party is managing the network infrastructure.

 

Here is what is currently configured.

 

2x SAN Nodes (A, B) A - Primary Site, B - DR site

2x ESXi Hosts (1, 2)  1 - Primary Site, 2 - DR site

 

a) Switch model(s) - D-link DSG3100 at both sites

b) Stacked - Yes

c) ALB, LACP, HSB from the nodes - ALB

d) Spanning-tree? PVST, MSTP, PVSTP etc? Not sure. I don't think so.

e) Uplink to cluster nodes? LACP trunk, Etherchannel etc? None. Only trunked for ESXi hosts.

f) If switches are not stacked, are they connected to each other directly? (could cause weird STP issues) N/A

 

Thanks again for your feedback guys.

 

Regards,

 

Nick