ProCurve 2848 - All users loose connection


I have a problem on my network, users connections to servers will without warning just stop, i.e. they will loose connection, then, if you give it 10 seconds they can resyncronise (i.e. each PC thinks its working offline after disconnection-then have to resync once connections returns to get their network drives back) the clients are XP SP2 and servers are W2k3 sp1, it seems like the switch is getting flooded and stops all connections. but looking at the log on the switch, the only errors are those CRC errors which as yet I cannot find a resolution for (the CRC errors happen randomly on a variety of differrent ports)

I am looking for ideas or any help at all on how to reslove this, as my users are getting increasingly more annoied with having to resync to the server.

Note: We have the desktop and my documents of all PCs redirected to the file server.

Thanks in advance
Kell van Daal
Re: ProCurve 2848 - All users loose connection

Hi John,

You say they loose their connection. Do they loose the physical connection?
I asume not, since you say you only have CRC errors in your log, and not port offline/online events?

If they loose the connection, is it also not possible to reach other servers etc? Like pinging a different server, or even the switch.

You also say it seems the switch gets flooded. How do you tell? Is it possible to use a sniffer to get a capture of what happens during a disconnection.

Also, do all users at the same time get disconnected, or only one (or a few) at the same time?

Are the users and the servers on the same VLAN?

Does it happen on certain times, or completely random?

Try setting up a permanent ping from one client to something else then the server. Also do the same from a server.

I know, more questions then answers ;)
But these questions will provide a lot more insight to the problem.

One thing that popped up which could be the cause (without more info, this is just a very maybe. It's based on your flooding). Do you have a "heavy" multicast source active when the disconnect occurs? Like a ghost multicast server. This in combination with no IGMP configured, will cause all ports to forward the multicast traffic, when you are for example staging one client.

Re: ProCurve 2848 - All users loose connection

I have to say, all excellent questions

1)They do not loose physical connection.

2)I have not fully tested the next point, but, generally, I think when they loose connection, the PC will hang, i.e. stop responding, and they loose connection to all servers, I will setup a pinging PC to ping a variety of servers on our network, and wait for it to happen again, so I'll have to get back to you with an accurate answer here.

3) I would be possible to setup a sniffer, but I would have to guess on which port to put it on

4)It only seems like it is a group of users at the same time that get disconnected.

5)All users and servers are on the same VLAN

6) It is completley random

We should not have a heavy miulticast source on the network.
not sure if IGMP is configured, what is this IGMP again?

Thanks again

Kell van Daal
Re: ProCurve 2848 - All users loose connection

Hi John,

IGMP stands for Internet Group Management Protocol.
In short it makes sure multicast data is only delivered on ports that have a client connected that "subscribed" for that data.
Without IGMP multicast is handled as broadcast traffic, flooding all multicast data to all ports (in the same broadcast domain). If I take my ghost server example again. It means when staging one pc with multicast, all the other pc's also would get that data (which is a lot and fast with ghost). Seen it happen a few times. Really brings the clients (usually not the switches)to a halt.

If I read your answers, my prelimenary conclusion would be that it isn't the switches that is causing the problems. Maybe name resolution isn't working correctly? Especially if you are still using Netbios, this can give some problems like you describe. So maybe a good thing to look in that direction also. Check WINS (if you use Netbios), check DNS, and not only on the server, but also on the client (nslookup and nbtstat for example). Also check master browser status and such.
But a capture of the network during such a disconnect should be able to tell us that also.

Reactions on your answers:

3) I think using a sniffer at this point would be the fastest way to determine the cause, since it will give you the most information. About choosing ports. You can use port-mirroring, so you don't have to choose one port. Just choose a few ports where it is likely to happen again. I don't know the interval between the disconnects, so if you need to run the sniffer a whole day or so, be aware that it will probably create a huge log.

4) If it is only a group of users that get disconnected at the same time, do they have something in common? Are they all on the same switch (and no user w/o problems on that switch). Or something else that they have in common, like the same (but different from other users w/o problems) fileservers for their redirects? Things like that.

5) Are all the servers and users also on the same switch, or do you have multiple switches (sorry, forgot to ask the first time around).

Still no answers, but we are getting closer ;)
I think sniffing and interpreting the log would get us the info we need.

Good luck!

Re: ProCurve 2848 - All users loose connection

Thanks for the response Kell,

it has just happened again, and yes, I could ping other servers in time of disconnection, so that does indicate it is the server that is causing the problem, not the switch.

The server is a file server, and is a Proliant DL380, I have the network cards teamed (each at 1gb) giving a 2GB connect.

IGMP is off

I have three HP Procurve 2848's each connected by one 1GB connection in from a main switch.

I don't think it is a name resolution problem, I don't use NetBIOS or WINS, and have checked DNS on PC and server.

Can you choose multiple ports to mirror or just one ?

The disconnects do not happen to a specific set of users, and I 'think' it is happening across multiple switches, which again points to the server to be having the problem.

Think what I'll do now is check the driver versions on the NICs in the server, see if they are up to date, and see about setting up a packet sniffer.
If I can only setup one port to sniff, would it be best on one of the server connections or on a client?

Getting there!

Kell van Daal
Re: ProCurve 2848 - All users loose connection

Hi John,

Good to see we are progressing.
About sniffing. The way I would do it is as follow:

Dedicate one machine to sniffing. So not a workstation that is being used, because that will make the log only bigger.
Put that machine on the same switch as the server on for example port 1.
Configure that port to be the mirror port:
ProCurve2848(config)#mirror-port 1
Let's assume a client that has the problem is on port 2. I would configure port 2 to be mirrored:

This normally should give you enough information. If I am wrong, and the log does not reveal any possible causes, then you could also monitor the server.
You can monitor (mirror) multiple ports at the same time. So if traffic isn't high on clients, you can consider mirroring multiple clients at once.
Just don't forget to remove the mirror port after you're finished, if you are gonna use that port for normal operations.

About the teaming. What setup do you use? NFT, TLB, Fast path etc?

If this turns up empty handed also, consider using performance monitor on the server (I assume you allready checked the eventlog, but got nothing usefull). Maybe the server has a bottleneck (which I doubt).
You could also check if all FSMO roles (domain roles) are being used (usually the first DC in the domain, but this isn't nessecary). In what mode (functional level) is your domain and forest running?`

Also you said you don't use netbios. Is netbios disabled on all the workstations, or do you just don't use WINS etc. If Netbios isn't explicitly disabled, and you do a:
net use \\server\share
for example, the client will try to resolve that name through a Netbios broadcast.
You can check if clients do that with:
nbtstat -r
If any of the numbers there isn't zero, then you are still using Netbios.

Allways nice to have a good challenge ;)

Re: ProCurve 2848 - All users loose connection

Hi Kell,

okay, I have setup a PC to sniff a workstation that could have a problem. The sniffing PC is not doing any other tasks. I have not set it to monitor the server as of yet.

Last night I also updated the drivers to the newest version on both network cards in the server.
I also had a look at the teaming, it's team settings were set to..
Team Type selection: Automatic(recomended)
Transmit Load Balancing Method: Destination MAC address.

I changed the Transmit Load Balancing Method to: Automatic(recomended)

Do you think this is the best config? As yet, this morning we have not had an instance of users loosing connection. But, if/when it happens again, I think I will look to possibly change these 2 settings to a manual config not automatic.

I hadnt thought of using the performance monitor on the server, I'll do that.

Nothing in the event log of any relevance

The FSMO roles are split over a couple of DCs, and we are running in native mode, all servers are 2003 standard.

And it seems we are indeed using Netbios i tried nbtstat -r and I had 4 items resolved by broadcast. So, does this need to be disabled on the client? can it be done through group policy? Makes sense not to be using this to reduce on the number of broadcasts through the switches.

I'll keep you posted on what the sniffer throws up.

Thanks a load