topic 2-cpu Node unresponsive under heavy load in Operating System - Linux

2-cpu Node unresponsive under heavy load

Lachele Foley — Mon, 24 Jul 2006 15:02:26 GMT

We have a cluster of DL360 G3's, each with two cpus. We're running RHEL (2.4.21-27.ELsmp #1 SMP).

Sometimes, when users run Jaguar (a quantum mechanics package), one or more of their nodes become unresponsive during portions of the run. These runs typically go for several hours to several weeks and the node may be unresponsive for hours or days during that period (not that I check constantly, mind you).

Unresponsive means: usually responds to ping, will not allow ssh, will not allow control by the cluster management software (Scali), usually shows as "node alive" in the cluster management software, often will not allow local login (with keyboard, monitor & mouse attached to the node itself) -- the login times out while the password is being checked.

The user whose job is running, on the other hand, says that everything is just fine.

Schrodinger (who makes Jaguar) says the trouble is that the two cpu's are battling for access to the one hard drive. They say we should run multi-cpu jobs using one cpu per node. We don't much like the thought of doing that.

Does anyone here have experience with this? Is there a way to set access parameters for the HD? I'm thinking of something like the control one has over I/O with an nfs mount. Does that exist? Does anyone have any other solutions?

Thanks!

:-) Lachele

Re: 2-cpu Node unresponsive under heavy load

Vitaly Karasik_1 — Tue, 25 Jul 2006 05:29:34 GMT

multi CPU machines are pretty common things for many years, and linux works with SMP pretty well.

Do you see something interesting in /var/log/messages?

I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)

Re: 2-cpu Node unresponsive under heavy load

Lachele Foley — Tue, 25 Jul 2006 10:26:52 GMT

"multi CPU machines are pretty common things for many years, and linux works with SMP pretty well."

I agree. This is the only program I've seen do this. Many other users run other programs, even other QM packages, without this issue.

"Do you see something interesting in /var/log/messages?"

Nope. No unusual entries at all.

"I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)"

We do a lot of different things here and run a lot of different programs. Upgrading to fix one issue can mean breaking five others. So, we only make major changes when absolutely necessary. This problem doesn't fall into "absolutely necessary." Besides, I'd want to know that upgrading would actually fix the problem. Do you think it would?

Re: 2-cpu Node unresponsive under heavy load

George Liu_4 — Tue, 25 Jul 2006 12:41:28 GMT

It is more likely the application software issue instead of kernel. Just a curiosity, what is the output of "sar -d" of the problematic time?

Re: 2-cpu Node unresponsive under heavy load

Steven E. Protter — Tue, 25 Jul 2006 15:15:00 GMT

Shalom,

There are known bugs with RH clustering the kernel and releases even including RH 3 update 6. Though you are not using all of the componenets, you have a possible kernel issue.

You may find that the latest RH 3 update 8 kernel helps or if the applications support it an upgrade to the 2.6 kernel is appropriate to solve this issue.

SEP

Re: 2-cpu Node unresponsive under heavy load

George Liu_4 — Tue, 25 Jul 2006 15:40:22 GMT

There are too many issues on RHEL3 Update 8. Please hold on for this update.

Re: 2-cpu Node unresponsive under heavy load

Lachele Foley — Wed, 26 Jul 2006 12:59:33 GMT

Thanks for all the responses!

Last night, after negotiations with the user, I started a duplicate of one of his jobs (as that user, so same conditions) except that the 8 cpu's were on different nodes. So far, all the nodes remain responsive. [I'd done something like this before, but it was a hectic time, so wanted to re-test.]

"sar" doesn't exist on the compute nodes (though the headnode has it). I don't know if this is by design or accident, but agree that output from sar would help.

Like I said, this isn't an earth-shattering issue -- it just keeps me from taking cpu temperatures, etc., as often as I want to. So, I can wait to upgrade.

This page:

http://www.redhat.com/security/updates/notes/

..doesn't list an update 8 for RHEL 3. Is that the one to wait for?

Again, thanks to all!

Re: 2-cpu Node unresponsive under heavy load

Vitaly Karasik_1 — Thu, 27 Jul 2006 02:19:16 GMT

RHEL3 upd8 is available https://www.redhat.com/archives/taroon-list/2006-July/msg00046.html

And if Jaguar is just an application, without binary kernel modules, you can ask RH support for help.