Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

VLAN question on two 2810G-24

maknotek
Occasional Advisor

VLAN question on two 2810G-24

I have a customer who is having issues with his VMware ESX 4 servers reaching the storage VLAN(VLAN100) using two ProCurve Switches(2810). Originally, everything worked w/o any issues until we migrated his new hardware into his server rack. After the move, vCenter will send my customer several emails about losing connectivity to multiple datastores on the SAN. Everything stays up and running, users don't seem to notice it, but we continue to get spammed with error messages.

I would like to verify the configuration and make sure that we've got everything configured as it should be.

Both switches have ports 1-12 untagged for VLAN100 and NOT configured as DEFAULT_VLAN ports. Ports 13-22 are configured as untagged for the DEFAULT_VLAN and not configured as access ports for VLAN100. Pretty basic but jumbo frames is enabled for VLAN100(and the SAN/ESX Servers).

My worry is that there is something wrong with our redundant trunk links between the switches. We have ports 23-24 configured as Trk1 on both switches. I set Trk1 to be untagged for DEFAULT_VLAN and tagged for VLAN100. When I test everything from my laptop and/or other devices, everything appears to be functional. However, randomly we are getting messages about losing the path to our datastores. The message indicates that all of the servers/datastores are losing connectivity at random times during the day.

Right now, I have everything on one of the 2810Gs and it hasn't generated an error for a couple days. If move any of my storage links to the second switch then after a few hours, the errors will start appearing again.

Summary:

sw1
ports 1-12 DEFAULT_VLAN: No
ports 1-12 VLAN100: Untagged
ports 13-22 DEFAULT_VLAN: Untagged
ports 13-22 VLAN100: No
ports 23-24 configured as Trk1
Trk1 DEFAULT_VLAN: Untagged
Trk1 VLAN100: Tagged

I haven't configured anything else on the switches other than jumbo frames on VLAN100 for both switches. We had a slight power issue on the second switch which we replaced to eliminate that issue. We also had a bad network cable from one of the SAN ports and replaced that as well. I can't see anything else that could cause this issue other than my switch config so any help would be greatly appreciated.
14 REPLIES
Mohammed Faiz
Honored Contributor

Re: VLAN question on two 2810G-24

Hi,

That setup seems reasonable, a few questions.. Are you routing between the two VLANs?
Have you checked the switch logs for any errors when you see the messages from vmware ("show log" or "show int X" for the relevant ports) ?
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

There isn't any IP routing between the VLANs yet. We will eventually be using a Layer3 switch but right now it's just these two switches.

I haven't checked the logs but I will give that a shot as well.

Should the DEFAULT_VLAN be tagged on the trunk links as well or is that only required for VLAN100?
Mohammed Faiz
Honored Contributor

Re: VLAN question on two 2810G-24

I prefer to tag VLANs on links between switches because it avoids any "VLAN leaking" mistakes but it will work fine if you have one of them untagged.

> There isn't any IP routing between the VLANs yet.

So we're just talking about 2 layer 2 vlans, vlan 100 and vlan 1?
Are the errors only related to clients talking to other clients on VLAN 100? It would probably be easier to diagnose if you could give a specific example with hosts and their associated addresses.
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

Yep, just VLAN 1 and VLAN 100 on two layer-2 switches.

On both switches, I have 1 uplink from each ESX server for local traffic(172.16.10.1, 172.16.10.2, 172.16.10.3). They are bonded within VMware so the idea is that either switch can carry the load if the other fails.

I also have 1 uplink to each switch from each server for the storage network(VLAN100) and I'm using a 192.168.254.0 network.

Here is a general summary of what I've got configured. This exact setup seemed to be working until recently.

ESX1
Local IP: 172.16.10.1
Storage IP: 192.168.254.11 (VLAN100)

ESX2
Local IP: 172.16.10.2
Storage IP: 192.168.254.12

ESX3
Local IP: 172.16.10.3
Storage IP: 192.168.254.13

NetApp FAS2020
Controller1 IP: 192.168.254.21
Controller2 IP: 192.168.254.22

The errors come from each of the ESX servers indicating that they're losing connection to the NetApp at random times during the day. The virtual servers stay operational but there must be a momentary disconnect from the storage. Users don't really notice anything happening but there is a brief loss of connection to the NetApp.

What lead me to believe my switch configuration was off was that the storage kept dropping on each of the servers to volumes on either NetApp controller. If I leave everything on a single switch then it stays operational without any errors.

Let me know if there is any other information I can provide.

Mohammed Faiz
Honored Contributor

Re: VLAN question on two 2810G-24

Ok, again, that all looks very straight forward so you shouldn't be having an issue with that setup.
Do the disconnect errors all occur at the same time on the servers?
When you were using both switches did you have the NetApp connections split between the two switches or were they both on the same one?
As I mentioned before hopefully the switch logs can point you in the right direction.
As it's a fairly broad problem I'd also suggest taking a look at whether there's a software update available for your 2810s.
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

Yes, the NetApp has two links per controller and one is going to each switch.

The disconnects do seem to happen at the exact same time. I'll check the logs when I get up there and I'll also look for updates for the switches.

Thanks for all of the suggestions so far as well!

Re: VLAN question on two 2810G-24

Hi,
do you have spanning-tree enabled on port connected to ESX ?

I'm asking this because some months ago i found this article:

"ESX Server issues beacon packets from one adapter addressed to other adapters assigned to a virtual switch. By monitoring beacon reception, the server can detect distributed connection failures. These packets are, in some cases, interpreted by switches as BPDU packets. For some strange reason these beaconing packets seems to be behaving like BPDU packets from a switch point of view.


This can, in conjuction with Spanning Tree, cause major problems in a network triggering constant topology changes, hence triggers elections of rootbridges. The effect from this, except flooding your syslog server and SNMP services, render your network more or less useless."


maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

STP isn't enabled although looking at the running config shows me that it automatically enables some form of STP on trunk links.

Over the weekend we tried several things. We updated the switches to version 11.25 from 11.15. We also checked the logs thoroughly when the disconnect issues occurred but couldn't find anything useful. I did leave a oontinuous ping going from one switch to the other switch(VLAN100 to VLAN100) and noticed that there were several high response times when the disconnects occur. We also moved Trk1 to ports 21-22 and the same issue occurred. With a single cable between the two, the same issue happens as well.

Here is a copy of both switch configs as well:

Running configuration:

; J9021A Configuration Editor; Created on release #N.11.15

hostname "vsphere-sw1"
trunk 23-24 Trk1 Trunk
snmp-server community "public" Unrestricted
vlan 1
name "DEFAULT_VLAN"
untagged 13-22
no ip address
tagged Trk1
no untagged 1-12
exit
vlan 100
name "SAN"
untagged 1-12
ip address 192.168.254.1 255.255.255.0
tagged Trk1
jumbo
exit
spanning-tree Trk1 priority 4
password manager
password operator

Running configuration:

; J9021A Configuration Editor; Created on release #N.11.15

hostname "vsphere-sw2"
trunk 23-24 Trk1 Trunk
snmp-server community "public" Unrestricted
vlan 1
name "DEFAULT_VLAN"
untagged 13-22
no ip address
tagged Trk1
no untagged 1-12
exit
vlan 100
name "SAN"
untagged 1-12
ip address 192.168.254.2 255.255.255.0
tagged Trk1
jumbo
exit
spanning-tree Trk1 priority 4
password manager
password operator

We've already gone through testing all of the cables and replaced them and one of the switches. We're also going to try replacing the other switch to see if it makes a difference.
Mohammed Faiz
Honored Contributor

Re: VLAN question on two 2810G-24

The automatic spanning tree additions just adjust the port cost, they don't actually enable any spanning tree protocols.
The high pings are interesting, especially if you're not dropping any pings or seeing any 'link down' messages.
It sounds likes something is generating a large amount of broadcast traffic or similar.
Is there anything that is using multicast on your network?
It's probably worth enabling IGMP snooping on your VLANs anyway (vlan 100 ip igmp) as it's useful to have and see if that makes a difference.
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

There shouldn't be any multicast devices on the network but I will certainly turn IGMP snooping on when I'm onsite tomorrow.

Re: VLAN question on two 2810G-24

hi maknotek,
you wrote:

--------------------------------------
"On both switches, I have 1 uplink from each ESX server for local traffic (...). They are bonded within VMware so the idea is that either switch can carry the load if the other fails"

"I also have 1 uplink to each switch from each server for the storage network"
--------------------------------------

How is Load-balancing-policy configured on ESX vswitch (there are many options: failover/route based on ip hash / etc.) ?

Are you using the "distributed vswitch" feature ?


regards
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24


Marco,

__________________________________________
How is Load-balancing-policy configured on ESX vswitch (there are many options: failover/route based on ip hash / etc.) ?

Are you using the "distributed vswitch" feature ?
__________________________________________

We are not using a distributed vswitch. On each ESX server we have 2 uplinks for each storage vswitch. They have load balancing set to use the virtual switch port ID and are set to beacon probing for failure detection. The other options are unchecked on each storage vswitch.

Any recommendations?
maknotek
Occasional Advisor

Re: VLAN question on two 2810G-24

Mohammed,

I configured igmp on VLAN100 but the issues are still occurring. I also replaced the other switch so we essentially have two brand new 2810s.

I'm wondering if it isn't an issue w/ VMware or one of the NetApp links at this point. I have a few identical setups to this that are working without any issues.
Mohammed Faiz
Honored Contributor

Re: VLAN question on two 2810G-24

It does very much point to an issue on the server/appliance side at the moment.
The final test would be to have a network capture (not necessarily capturing full packets, just header information to start with) running on a mirrored port to determine what exactly is occuring on the network at the times when a failure is reported.