Re: Network Bottleneck

Ralph Grothe · ‎08-05-2003

Hello,

I've got a cluster that is causing me some grief since the customer's DBAs/application developers changed their whole design.

The daft design flaw in my view is that although the cluster consists of 3 nodes with each of them running packages/db instances all client requests obviously have to make TCP connections to the "central" node/instance.
This is causing quite some network traffic on this node with the effect of most of the TCP db connections being blocked.

The central server is an N class which by now has reached it's final stage of HW upgradebility.
Please see my attachment for details where I collected some data.

I'm using the MWA (aka OpenView Performance Agent) toolkit for monitoring services.
I mainly left the preset configuration (as in/opt/perf/newconfig/alarmdef and /opt/perf/newconfig/parm) unchanged, and only made minor modifications to the respective files in /var/opt/perf (see attachment).

With these alarmdefs I get several network bottleneck alerts during the day (see utility sample output from yesterday in attachment).
The bottlenecks reported during the night (from 20:00 onwards) may be neglected here, as they are owe to backup traffic that wouldn't directly affect users unlike the bottlenecks during working hours).

My problem now is how to verify that the maximum bandwith of the NIC/LAN really has been reached.
To this end I would rather go for the BYNETIF_{IN|OUT}_BYTE_RATE metrics than the BYNETIF_{IN|OUT}_PACKET_RATE because I believe these would makemore conspicuous that the bandwith limit has been reached,
simply because of the Bytes/sec unit.

But I couldn't extract nor see in the PerfView tool a way to get the BYTE_RATEs charted.

(Btw. I feel that I don't need to monitor the ERROR_RATE or COLLISION_RATE since the cumulative MIB stats of the NIC suggest that in this respect everything is in order (see attachment))

Because of my poor Ethernet/ARP/IP knowledge I have to ask you the network experts how to translate from PACKET_RATEs (in Hz) to BYTE_RATEs while I don't know how large the average packet is.

So my rather naive (worst case) assumption would be the following product:

BYTE_RATE = PACKET_RATE * MTU * 8

Taking the experienced peak PACKET_RATEs (which are abt. 7000 Hz) and the default MTU the NIC is set to (see attachment) the above naive formula would already result some 84 MBit/s.
The NICs are 10/100 MBit/s quadruple Base-TX cards which per autonegotiation with the switch link partner operate at 100 MBit/s full duplex (see attachment).

Thus a network bottleneck would sound reasonable.

But on the other hand the packets could have average size as small as 64 octets, so that the theoretical maximum bandwidth will never be reached while the sheer packet handling through the stack layers may have already brought the btlan driver to its knees.

Is the only safe way to find out the average packet size by packet sniffing and extracting the packets' frame sizes?

How could I use nettl and netfmt to this end (have never used these HP-UX tools, but instead open source sniffers that use the libpcap's API)?

If I really confirm a LAN bottleneck here, what were my options to overcome it?

Mind you, I have no influence on the application.
My first remedy of course would be to reduce network connections altogether by a change of the application's logic.

Would it make sense to upgrade to GBit LAN.
I guess this would imply an upgrade of other components such as switches, routers etc., and thus be quite costly.
(It would also involve the willingness of the network admins)

Since the servers have quad NICs whose most ports are unused (of course one is standby for HA failover) are there ways to sort of distribute the load on several NICs.

I think in this context I've heard the buzz word Auto Port Aggregation.
How costly a solution would this be with regard to a SG cluster?

Regards

Ralph

Madness, thy name is system administration

Stefan Farrelly · ‎08-05-2003

Trying to monitor how busy the network connections are isnt easy. We use glance and perfview and still dont find it easy. The best way ive found to gauge network usage is landamin. It has display fields in octets which is simply bytes - so write a script to monitor that to see how many MB/s is going through a particular interface.

eg. here is a script to do it;

let z=0
let y=$(lanadmin -g mibstats 0|grep -i oct|grep Inbound|awk '{print $4}')
let y2=$(lanadmin -g mibstats 0|grep -i oct|grep Outbound|awk '{print $4}')
while true
do
let x=0
sleep 1
x=$(lanadmin -g mibstats 0|grep -i oct|grep Inbound|awk '{print $4}')
x2=$(lanadmin -g mibstats 0|grep -i oct|grep Outbound|awk '{print $4}')
let t=$x-$y
let t2=$x2-$y2
let y=$x
let y2=$x2
let z=$z+1
let t=$t/1000
let t2=$t2/1000
echo "${t} Kb/s inbound, ${t2} Kb/s outbound"
done

This reports like so for the default nmid (you may need to run lanadmin and change/set the default nmid if you want to monitor a different lancard);

5 Kb/s inbound into pluto2, 35 Kb/s outbound from pluto2
4 Kb/s inbound into pluto2, 3 Kb/s outbound from pluto2
8 Kb/s inbound into pluto2, 9 Kb/s outbound from pluto2
3 Kb/s inbound into pluto2, 1 Kb/s outbound from pluto2
18 Kb/s inbound into pluto2, 15 Kb/s outbound from pluto2
7 Kb/s inbound into pluto2, 60 Kb/s outbound from pluto2

If this shows max throughput then to fix it you only have 2 choices, Gigabit or Auto port aggregation. Seeing as everyone is using Gigabit nowadays I would go for that, but you may well find it easier and cheaper to install APA and add in another 100Mbit lan card.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Ralph Grothe · ‎08-05-2003

Stefan,

thanks for your suggestion.

I've also already thought about simply parsing the output of lanadmin's mibstats in a loop, and subtracting kB counts devided by the interval.

But I guess there is a more direct way to get to those MIB stats through SNMP get requests on the correct OID.
I would think this is more efficient than doing an exec of lanadmin each interval.
I would like to use the Net::SNMP Perl module from CPAN to this end.
But I don't know the OIDs of the sought quantities.
(yes I should look them up in the HP-UX's MIBs)

Does any SNMP guru know, or has anyone pulled them through Net::SNMP?

Madness, thy name is system administration

U.SivaKumar_2 · ‎08-05-2003

Hi,

IP packet size differs depending on the application you are using.

To analyse the bandwidth usage of your NICs i would suggest you to use MRTG traffic grapher tool which is freely available in internet.

you can also use a good MIB browser with hp-ux MIB to analyse the interface statistics instantly.

If you have lot of clients accessing the same hp-ux server then the total available bandwidth for each client will be,

bandwidth available for client = total available bandwidth for server / number of clients

since it is 100 MBps full-duplex then , we can say

available bandwidth = 200 / number of clients

if the result is too small that does no meet your application's requirement consider upgrading the hp-ux server's fast ethernet NIC to GigabitEthernet.

regards,

U.SivaKumar

Innovations are made when conventions are broken

Tim D Fulford · ‎08-05-2003

Ralph

0 - netperf will give you the PERFORMANCE data you need. http://hpux.connect.org.uk/ & look for netperf

1 - Average packet size is probably much less than the MTU. You can show it in the attached bit of perl or use lanadmin repeatedly. alternatively MeasureWare (OVPA) Version C.03.70 has both packet rate & kB/s. PER LAN INTERFACE (BYNETIF in the /var/opt/perf/reptall file)

2 - For OLTP type data transfer you will probably find that the packet size will be small (100Bytes) which IMPLIES (it may not be the case) that the network card will be Very inefficient e.g. you may be getting say 10% of the bandwidth available, e.g. 100Mbit/s card will only be able to do 10Mbit/s. The problem in this case is that it is tempting to say "hey lets use gigabit cards", this will not help (much) as the cards will probably have similar throughput (IO/s) limits because the CPU is the limiting factor. So buying faster CPUs would help!!! Remember this ONLY applies if the system is throughput limited (IO/s) as opposed to bandwidth limited (kB/s or reasonable fraction of network card speed)

Regards

Tim

-

Tim D Fulford · ‎08-05-2003

Hi

Looking at some notes (& please dont sue if you take the info onboard & it turn out not to be applicable). The results are REALLY DEPENDANT ON YOUR SYSTEM & LAN CARDS an...

N4000 with 8x550MHz CPUs will do about 20,000 IO/s (or 20 kHz) using netperf, x-cable on 100Base-T (Full duplex). The point where the system goes from throughput limited to bandwidth limited is about 100-150 byes/packet

I got the above using netperf, I suggest you do the same using the relavent network equipment as this will have some impact on the results.

Regards

Tim

-

Ralph Grothe · ‎08-05-2003

Tim,

your reply has been most helpful.

I already was wondering why on earth I could not find the BYNETIF_*_BYTE_RATEs.
Looks my documentation is newer than the Glance Pak bundle I use

# swlist B3701AA|grep -v ^#

B3701AA.MeasurementInt C.02.40.000 HP-UX Measurement Interface for 11.0
B3701AA.OVOPC-AGT A.04.17 IT/Operations Agent
B3701AA.OVOPC-SE-DOC A.04.17 IT/Operations Special Edition Documentation
B3701AA.OVOPC-SE-GUI A.04.17 IT/Operations Special Edition Java UI
B3701AA.OVOPC-SE A.04.17 IT/Operations Special Edition Product
B3701AA.Glance C.02.40.000 HP GlancePlus/UX
B3701AA.MeasureWare C.02.40.000 MeasureWare Software/UX

Since I have C.02.40 no wonder I only get PACKET_RATES displayed.

Also thanks for your script.
I was about to do exactly the same in Perl.
This will save me the hacking.
But I would want to try using the Net::SNMP module from CPAN and query the MIB more directly.
Unfortunately my knowledge of MIBs, OIDs, BER, and ASN is almost none existent.
I hope I can weed through the MIBs to find out the propper OIDs.

The netperf utility sounds promising, though I haven't looked up the URI yet.
I will give it a try on a test system first.

Madness, thy name is system administration

Tim D Fulford · ‎08-05-2003

As you've got MeasureWare C.02.40 on your system why not upgrade to the latest? I believe you can under the licencing agrement, you just need to get intouch with HP for the codewords etc....or install the trial one (I'd only do this out of desperation)

Regards

Tim

-

Ralph Grothe · ‎08-05-2003

U.SivaKumar,

thanks for giving the per client bandwith hint.
Since most of the connections are TCP to the DB instances I can lookup the process table and filter for LOCAL=NO args of the cmd line to get the connects per instance.
Unfortunately I don't know the view or table of the data dictionatry to query per SQL the instances (OK this would add to the TCP connections;-).
Do you happen to know where to look it up from within Oracle?
I probably could also parse "netstat -an -f inet" output by the local port numbers on the DB server.

I do know the MRTG which I think is also a collection of Perl scripts/modules.
Thus I think from looking at its sources I could find out how they query the MIBs.
I bet they use another Perl module like Net::SNMP to this end.

Madness, thy name is system administration

Ralph Grothe · ‎08-05-2003

Tim,

you are right I should upgrade MWA since we have a valid license anyway.
I found the codeword generation for the application CDs which we get on a regular basis really a pain in the a*.
I never ever succeeded to receive a codeword from the HP webserver, but always got timed out or webserver not responding.
That's why we kind of sticked to the by now dated version of the MWA.
I will have a look at the latest application CD I received by mail (where there came no codewords with it).

Madness, thy name is system administration

Ralph Grothe · ‎08-05-2003

Tim,

it looks as if we've been supplied lately only with recent HP-UX 11i application CDs.

The latest one for 11.00 I could find is from June 2002.

If the glance pak from there isn't more recent could I also install the depot from the 11i CD?

Will I have to roll or save the /var/opt/perf/datafiles/* before doing the swremove of the old pack?

Usually swremove leaves config files untouched.

Madness, thy name is system administration

U.SivaKumar_2 · ‎08-05-2003

Hi,

I am interested to know the packet round trip time ( ping ) in peak load time and during light load time.

you can list the number of actively data transfering sessions by this command.

#netstat -an | grep ESTABLISHED | wc -l

I hope you will be interested in this tool for benchmarking the network performance of your server at light load and peak loads.

http://dast.nlanr.net/Projects/Iperf/

regards,

U.SivaKumar

Innovations are made when conventions are broken

Tim D Fulford · ‎08-05-2003

The June 2002 stuff may be OK. put cd in & do

# swlist -s /cdrom

this will list all the products regardless if they are locked, so you can check the version from there. C.03.70 is not THAT recent (about Dec 2002) so you may get away with C.03.45 or so.

I've upgraded a few systems & havenever needed to save off the /var/opt/perf/datafiles/ stuff. HOWEVER I have as a belt & braces thing done the following before Just INCASE

/opt/perf/bin/extract -xt -gancd -f

The "gancd" stuff is global, application, network, config & disk, what I consider the most important classes, so if there are others you think are important, then put them on the end of the list. The binary output can then simply be used as a single binary source file so extracting info do

/opt/perf/bin/extract -xp - -r -l ....

The version of extarct can be any as long as it is the same or newer than the original was created with.

The only other bit of advice I would give is, before installing the newer version of MWA stop it & the ttd!
/opt/perf/bin/mwa stop
/opt/perf/bin/ttd -k
swinstrall ....
/opt/perf/bin/mwa start

Good luck

Tim

-

Ralph Grothe · ‎08-05-2003

Siva,

thanks for the URL

Tim,

thanks for outlining the MWA upgrade procedure.

Cheers

Ralph

Madness, thy name is system administration

rick jones · ‎08-06-2003

historically, the alarms in glance et al have been a bit low.

checking the byte rates si the way to go.

there are some other things:

1) If the NIC is consistently loaded, the outbound queue length stat in lanadmin (also tracked by glance et al these days IIRC) will be non-zero and stay there

2) If the NIC is really overloaded, there will be outbound discards in the lanadmin et al stats, there will also be TCP retransmissions recorded in netstat -p tcp output, and you may also see inbound ot of order packets

3) I'm a triffle surprised that an 8x550 N maxed-out on netperf TCP_RR at 20,000 packets per second (presumeably 10,000 transactions per second reported by netperf) Did one specific CPU peg at 100% during that test? The request/reponse size was 1 byte yes?

4) APA is indeed the way to go active active on the unused ports on the quad cards. Going to different switches in the same trunk means active standby though.

5) Other possibly useful things:

ftp://dist/networking/briefs/
annotated_netstat.txt
sane_glance.txt (this is a bit dated)

ftp://dist/networking/tools/
connhist
beforeafter

6) Netperf can be used to measure the limits of a system/network/nic, but will not tell you if a given system/network/nic is overloaded in its day to day stuff - it is a benchmark, not a monitor.

there is no rest for the wicked yet the virtuous have no pillows

Tim D Fulford · ‎08-07-2003

Rick

Regarding the figure of 20,000 packets. I did not do a TCP_RR what I did (& it is fuzzy in my mind, some time ago) was to send packets of varying lenghth & measure the bandwidth (as per the tcp_stream_script with netperf). You can plot a graph of pkt_size Vs pkt/s & kbit/s

When pkt/s peaks, we used this as the MAX throughput figure.

Now my notes are super fuzzy (OK they are files in the bin) on this I forget how it was exactly done but I think I used the following

use ndd to set tcp_naglim=1 (get rid of nagle)
netperf -H -s X -S X -m X -M X -l 60

so this sets the In/Out sockets to X bytes & we send & recieve a message of X bytes & run for 60 seconds. From what you said I infer I should have done

netperf -H -s X -S 1 -m X -M 1 -l 60

Which may nean my results are lower than they should be.

The reason I did not use TCP_RR is I did not understand it (enough to be able to take the results & do calculations, draw up conclusions etc). Again from what you said I shopuld be using

netperf -H -t TCP_RR

anyway that is the whole truth nothing but the truth as I remember it guv...

Tim

-

rick jones · ‎08-07-2003

I guess now that people are starting to use it I will have to add tcp_naglim_def to my annotated_ndd.txt file... that was a clever and perhaps one of the only valid uses for it :) I probably would have just set -D as a test-specific option and had netperf set TCP_NODELAY :)

I didn't mean setting the socket buffer to one byte. A "single-byte, netperf TCP_RR test" means something like:

$ netperf -t TCP_RR -H -- -r 1

When you talk about packets per second, are you calculating that from the reported throughput? I just recently discovered (running an undocumented variant of the TCP_RR test) a behaviour of the HP-UX 11 TCP stack that likely applies to what you are doing - when the HP_UX 11 TCP stack will issue an immediate ACK whenever it receives a second sub-MSS segment in a row. This means that the "normal" ACK avoiadance algorithms in HP-UX 11 TCP are effectively shut-off, and any attempt to caculate packets per second from the netperf output alone will be off by something like 50%.

there is no rest for the wicked yet the virtuous have no pillows

Ralph Grothe · ‎08-08-2003

Hi Rick,

in case you have received and found time to read my email,
would you please be so kind to continue this thread?

Regards

Ralph

Madness, thy name is system administration

Tim D Fulford · ‎08-08-2003

Ralph, sorry Rick & I are haveing a nice little discussion in the middle of your thread!! I hope some of it is useful

Anyhow, Rick I got the packets per second from a pre & post lanadmin, is this effect in there?

Back to Ralphs original question, how many packets per second would you guesstimate an 8x550MHz N4000 using 100Base-T to do? I think that is what Ralph is using! And no, "well it depends", "how long is a piece of string" "what is the weather like on Mars" stuff. I think we can take all the caveats as read & hopefully no law-suits will follow!!

Regards

Tim

-

Ralph Grothe · ‎08-08-2003

Tim,

I meanwhile contacted Rick per mail, attaching some of the dumps fromt he netperf tests I ran after having compiled and installed the tool.
I've been so insolent because I believed Rick to be the programmer of netperf (or one of the team, if it was a joined effort).

Because of my network illiteracy I need some help on the assesment of these results.
For instance I only used the standard INET Socket buffer sizes,
and I got warnings from netperf that my chosen confidence interval could not be reached within the specified number of iterations.
Btw. the same warning appeared when I ran the tcp_stream script that iterates over several buffer sizes.

Madness, thy name is system administration

Tim D Fulford · ‎08-08-2003

Ralph

A wise decision going directly to Rick!

good luck

Tim

-

rick jones · ‎08-08-2003

If you derived pps figures from lanadmin then that will be accurate - in that it will cound both data segments and bare acks.

When I last ran aggregate (several parallel instances) netperf, single-byte TCP_RR tests through a 100BT NIC on a J5000 (2x440 MHz) I was able to achieve ~35000 transactions per second, which would have been 70000 packets per second through the 100BT NIC.

Now, there is higher memory latency in an N than in a J, but the N also has a 550 MHz CPU.

Of course, we used different methods and they might not yield the same results.

there is no rest for the wicked yet the virtuous have no pillows

Ralph Grothe · ‎08-12-2003

Tim,

I modified your Perl script a little.
But the basic accounting and summations should have stayed the same, provided I didn't introduce any bugs (please see attachment).
I have to admit the code got more obfuscated (I simply cannot get used to the KISS principle).
However, it is still work in progress, and I intend to extend it at least by a column for bandwidth per connected user.
My naive assumption is to count the sockets from a "netstat -anfinet" dump and divide kBit rate by counted sockets.

I ran my script at different intervals, and finally to be comparable to glance's reports at 5 sec intervals.
But I noticed that especially the inpacket rate greatly deviates from what glance is reporting.
How come, is our counting too crude.

I also think to have noticed that the average packet size sometimes extends the 1518 Byte upper bound of Ethernet.
Thus I assume the devisor (i.e. inpacket rate) is sweeping some packets under the carpet.

Can you explain to me what the "Specific" count of the HP-UX lanadmin mibstats stands for?

I suspect that we haven't counted the real total of packets.
Maybe we overlooked packets of other protocols?

Regards
Ralph

Madness, thy name is system administration

Ralph Grothe · ‎08-12-2003

Hi networkers,

how can I make this thread to be noticed again.

I think this is a drawback with ITRC threads that only the poster is getting email notification, and only if he so desired, while once the thread has been read or points have been awarded no one seems to care.

Anyway, would like get someone's advice if by relating the MIB stats pulled kbit/s rate to the number of established internet sockets would give a rough estimate for the bandwidth per connector?

The most straight forward way to get the Nos of established sockets would be an invocation of a wee Perl sub like this:

sub netstat_socks {

scalar grep /ESTABLISHED/, qx(/usr/bin/netstat -an -f inet);
}

Of course, I would miss SYN and ACK packets I believe.

If I wanted to take those into account as well what other socket states would I have to care for as well (e.g. FIN_WAIT, CLOSE_WAIT etc.)?

Madness, thy name is system administration

Tim D Fulford · ‎08-12-2003

Ralph

OK, I have not looked at the perl script yet as I seem to have some weird network probs (it won't open new IE windows!!)

I'll take a look as soon as "normal operation returns"

Tim

-

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Network Bottleneck

Network Bottleneck