Operating System - Linux
1831703 Members
2565 Online
110029 Solutions
New Discussion

management agent performance issues on DL580 G2/RHEL 3

 
Matthew Lee_2
New Member

management agent performance issues on DL580 G2/RHEL 3

On a recently installed DL580G2 running RHEL3 we began noticing frequent pauses while typing in a terminal window, ie. the display wasn't keeping up with keyboard. The problem could be clearly demonstrated by holding down a key ... every 15 seconds or so, the stream of characters would slow to a crawl, then resume at full tilt again. This happened both at the console and in an ssh window.

Well, sure enough, the cmaidad process seems to poll on a 15 second cycle, and was always at the top of the run queue when we lost keyboard response. Running a strace on the cmaidad, it was observed that several ioctl calls similar to the following were occurring at every slowdown:

open("/dev/cciss/c0d0", O_RDONLY) = 3
ioctl(3, CCISS_PASSTHRU, 0xbfffddd4) = 0
close(3) = 0

If we disable cmaidad, the problem goes away.

Our DL580 has 2 X 2.5GHZ Xeons, with 8GB of RAM and a 5i+ smartarray running RAID 5.
The Redhat install is up to date patchwise, with kernel rev 2.4.21-15.ELsmp.
For hpasm/cmastor/cmanic we've tried versions 7.0.0-23/7.0.0-16/7.0.0-4 and 7.1.0-145/7.1.0-12/7.1.0-5 with the same result.

Interestingly, a DL380G3 with a 5i controller and an *identical* OS/software installation does not demonstrate the same problem.

Also, the affected DL580G2 previously had RHEL 2.1 with recent versions of the agents installed without exhibiting this issue.

The keyboard response is merely annoying, but we're also concerned about other impacts on performance that this may be causing.

Any thoughts?

Thanks
Matthew Lee
Fleming College
10 REPLIES 10
Vernon Brown_4
Trusted Contributor

Re: management agent performance issues on DL580 G2/RHEL 3

Sounds like the culprit process may be waiting for I/O. You may be able to fix it by changing the nice number for the process itself or for the device it is waiting for.

Matthew Lee_2
New Member

Re: management agent performance issues on DL580 G2/RHEL 3

I think I've resolved it. I upgraded the system firmware from 2003.11.18 to 2004.05.02 and the problem went away.

Thanks
Matthew
Matthew Lee_2
New Member

Re: management agent performance issues on DL580 G2/RHEL 3

Looks like I spoke too soon. We were sure the problem had gone away for 2-3 days, but we came in Monday morning and it was back.

I've had to turn off the agents, otherwise
Oracle (the only app on the box) slows to a crawl during periods of medium-heavy activity.

As for nicing processes, per Vernon's suggestion, it has neglible impact, and
certainly shouldn't be necessary.

Anyone running the agents on a DL580G2 with RHEL3 and *not* having problems?

Thanks
Matthew
Michael Williams_6
Trusted Contributor

Re: management agent performance issues on DL580 G2/RHEL 3

Hi there, we've got five DL580G2's and they all exhibit this problem. We have about 15 DL380G3's and none of them exhibit this problem, we have a similar set-up to you, only we're running UnitedLinux instead of RH.

This makes me think there's something wrong with the management agents as it's affecting only one class of server.

Just to check though, we're running a 5i for system and 6400 controller for data on the 580's, and only a 5i and/or 5300 on the 380's. Are you running the same?

Do you have a support contract? Have you logged a call or got it fixed yet? If not I might log a call as it's quite worrying the amount of time this process has clocked up on these servers...
Matthew Lee_2
New Member

Re: management agent performance issues on DL580 G2/RHEL 3

Michael,

We're using the 5i for the OS and also have an external FC array for data connected via an Emulex HBA, but no 6400. We had a similar setup on our 380.

We're no further along in resolving the issue. We don't have a support contract beyond what the warranty provides, but like you, we're at the point where we'll need to open a call with HP to get this resolved.

Thanks
Matthew
Joe Rybacek
New Member

Re: management agent performance issues on DL580 G2/RHEL 3

We have a similar issue with CMAIDAD. ML570G2 / RHEL 3 (2.4.21-15.0.3.ELsmp) / hpasm (7.1.0-145) / cmastor (7.1.0-12). I don't think it impacts the general performance of the server, however I'm curious as to what it does. What is IDA? It must be similar to FCA (fiber channel).
Edmund White
Frequent Advisor

Re: management agent performance issues on DL580 G2/RHEL 3

Was this ever resolved? I'm experiencing this on a DL740 with the 7.1 agents.
Michael Williams_6
Trusted Contributor

Re: management agent performance issues on DL580 G2/RHEL 3

Blimey this was some time ago!

We do have a support contract and I logged a call with HP themselves. My problem was more that the cmaidad process was utilising way too much processor time, but I guess the side effect could have been as you stated, we don't really use the console here so I couldn't tell you!

Anyway, this problem is apparently now fixed in PSP7.2, which you'll note isn't out yet. HP Are trying to do an interim release of just the agents, but they don't appear to be published yet.

My last update shows that the PSP release is due at the end of the year... I guess we'll have to wait!
Edmund White
Frequent Advisor

Re: management agent performance issues on DL580 G2/RHEL 3

I fixed this by using the 6.40 agents, which seem more stable anyway. I used a modified version of Red Hat 8 with a custom kernel, so I had that option. The 7.1 drivers work well on some of the lower-end servers. Odd. Is there a direct contact to HP's Linux development/solutions group?
Mike Jagdis
Advisor

Re: management agent performance issues on DL580 G2/RHEL 3

The 6th thread :-). Pasting the same reply :-).

Actually you're half way there. You just lack final kernel enlightenment :-)

-----------------------------------------------
The explanation is somewhat involved...

If you trace cma*d you'll find that it doesn't do anything but open the device, ioctl, close. Admittedly rather more times than should be necessary but that's just incidental bad design.

You'll find the delay - and system time consumption - seems to happen on the close. From here you need a fairly good working knowledge of the Linux kernel...

Ok? still with me then?

Run oprofile for a while and you'll find the cpu time is being consumed by invalidate_bdev. Which is interesting :-).

Invalidate_bdev is called from kill_bdev. Kill_bdev is called from the block device release code. Release is what happens on last close. Now the monitoring daemon is opening the unpartitioned disk device which it is pretty certain nothing else has open. (Off hand I'm not sure if even having an fs on the device counts as it being open. There are subtle differences and I *think* I'm right in saying that block device access and fs access is considered different at this level. Don't quote me or blame me!)

So, each close triggers invalidate_bdev. Why is this so bad? Well, the idea is that when the last close happens on a device you need to flush any cached data because, with much PC HW, you can't be sure when the media gets changed. Invalidate_bdev isn't *meant* to be called often. It works by scanning through the entire list of cached data for block devices to find and drop data related to the device being closed. So it sucks system time and the amount is proportional to the amount of cached (from any device) data you have.

WORKAROUND:
All you need to do is to make sure that each time the cma*d daemon closes the device it isn't the *last* close - i.e. some other process has the device open. The other process doesn't even need to *do* anything. Try something along the lines of:

sh -c 'kill -STOP $$' < /dev/cciss/c0d0 > /dev/null 2>&1 &

Hope that's all clear! (As mud... :-) )

(HP: As well as blind debugging I do Linux & OSS consultancy. I happen to know the answer to this one as it came up at a major investment bank...)