Hello,
I've got a cluster that is causing me some grief since the customer's DBAs/application developers changed their whole design.
The daft design flaw in my view is that although the cluster consists of 3 nodes with each of them running packages/db instances all client requests obviously have to make TCP connections to the "central" node/instance.
This is causing quite some network traffic on this node with the effect of most of the TCP db connections being blocked.
The central server is an N class which by now has reached it's final stage of HW upgradebility.
Please see my attachment for details where I collected some data.
I'm using the MWA (aka OpenView Performance Agent) toolkit for monitoring services.
I mainly left the preset configuration (as in/opt/perf/newconfig/alarmdef and /opt/perf/newconfig/parm) unchanged, and only made minor modifications to the respective files in /var/opt/perf (see attachment).
With these alarmdefs I get several network bottleneck alerts during the day (see utility sample output from yesterday in attachment).
The bottlenecks reported during the night (from 20:00 onwards) may be neglected here, as they are owe to backup traffic that wouldn't directly affect users unlike the bottlenecks during working hours).
My problem now is how to verify that the maximum bandwith of the NIC/LAN really has been reached.
To this end I would rather go for the BYNETIF_{IN|OUT}_BYTE_RATE metrics than the BYNETIF_{IN|OUT}_PACKET_RATE because I believe these would makemore conspicuous that the bandwith limit has been reached,
simply because of the Bytes/sec unit.
But I couldn't extract nor see in the PerfView tool a way to get the BYTE_RATEs charted.
(Btw. I feel that I don't need to monitor the ERROR_RATE or COLLISION_RATE since the cumulative MIB stats of the NIC suggest that in this respect everything is in order (see attachment))
Because of my poor Ethernet/ARP/IP knowledge I have to ask you the network experts how to translate from PACKET_RATEs (in Hz) to BYTE_RATEs while I don't know how large the average packet is.
So my rather naive (worst case) assumption would be the following product:
BYTE_RATE = PACKET_RATE * MTU * 8
Taking the experienced peak PACKET_RATEs (which are abt. 7000 Hz) and the default MTU the NIC is set to (see attachment) the above naive formula would already result some 84 MBit/s.
The NICs are 10/100 MBit/s quadruple Base-TX cards which per autonegotiation with the switch link partner operate at 100 MBit/s full duplex (see attachment).
Thus a network bottleneck would sound reasonable.
But on the other hand the packets could have average size as small as 64 octets, so that the theoretical maximum bandwidth will never be reached while the sheer packet handling through the stack layers may have already brought the btlan driver to its knees.
Is the only safe way to find out the average packet size by packet sniffing and extracting the packets' frame sizes?
How could I use nettl and netfmt to this end (have never used these HP-UX tools, but instead open source sniffers that use the libpcap's API)?
If I really confirm a LAN bottleneck here, what were my options to overcome it?
Mind you, I have no influence on the application.
My first remedy of course would be to reduce network connections altogether by a change of the application's logic.
Would it make sense to upgrade to GBit LAN.
I guess this would imply an upgrade of other components such as switches, routers etc., and thus be quite costly.
(It would also involve the willingness of the network admins)
Since the servers have quad NICs whose most ports are unused (of course one is standby for HA failover) are there ways to sort of distribute the load on several NICs.
I think in this context I've heard the buzz word Auto Port Aggregation.
How costly a solution would this be with regard to a SG cluster?
Regards
Ralph
Madness, thy name is system administration