Operating System - Linux
1827290 Members
3925 Online
109717 Solutions
New Discussion

Re: 2-cpu Node unresponsive under heavy load

 
Lachele Foley
Occasional Contributor

2-cpu Node unresponsive under heavy load

We have a cluster of DL360 G3's, each with two cpus. We're running RHEL (2.4.21-27.ELsmp #1 SMP).

Sometimes, when users run Jaguar (a quantum mechanics package), one or more of their nodes become unresponsive during portions of the run. These runs typically go for several hours to several weeks and the node may be unresponsive for hours or days during that period (not that I check constantly, mind you).

Unresponsive means: usually responds to ping, will not allow ssh, will not allow control by the cluster management software (Scali), usually shows as "node alive" in the cluster management software, often will not allow local login (with keyboard, monitor & mouse attached to the node itself) -- the login times out while the password is being checked.

The user whose job is running, on the other hand, says that everything is just fine.

Schrodinger (who makes Jaguar) says the trouble is that the two cpu's are battling for access to the one hard drive. They say we should run multi-cpu jobs using one cpu per node. We don't much like the thought of doing that.

Does anyone here have experience with this? Is there a way to set access parameters for the HD? I'm thinking of something like the control one has over I/O with an nfs mount. Does that exist? Does anyone have any other solutions?

Thanks!

:-) Lachele
7 REPLIES 7
Vitaly Karasik_1
Honored Contributor

Re: 2-cpu Node unresponsive under heavy load

multi CPU machines are pretty common things for many years, and linux works with SMP pretty well.

Do you see something interesting in /var/log/messages?

I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)
Lachele Foley
Occasional Contributor

Re: 2-cpu Node unresponsive under heavy load

"multi CPU machines are pretty common things for many years, and linux works with SMP pretty well."

I agree. This is the only program I've seen do this. Many other users run other programs, even other QM packages, without this issue.

"Do you see something interesting in /var/log/messages?"

Nope. No unusual entries at all.

"I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)"

We do a lot of different things here and run a lot of different programs. Upgrading to fix one issue can mean breaking five others. So, we only make major changes when absolutely necessary. This problem doesn't fall into "absolutely necessary." Besides, I'd want to know that upgrading would actually fix the problem. Do you think it would?
George Liu_4
Trusted Contributor

Re: 2-cpu Node unresponsive under heavy load

It is more likely the application software issue instead of kernel. Just a curiosity, what is the output of "sar -d" of the problematic time?
Steven E. Protter
Exalted Contributor

Re: 2-cpu Node unresponsive under heavy load

Shalom,

There are known bugs with RH clustering the kernel and releases even including RH 3 update 6. Though you are not using all of the componenets, you have a possible kernel issue.

You may find that the latest RH 3 update 8 kernel helps or if the applications support it an upgrade to the 2.6 kernel is appropriate to solve this issue.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
George Liu_4
Trusted Contributor

Re: 2-cpu Node unresponsive under heavy load

There are too many issues on RHEL3 Update 8. Please hold on for this update.
Lachele Foley
Occasional Contributor

Re: 2-cpu Node unresponsive under heavy load

Thanks for all the responses!

Last night, after negotiations with the user, I started a duplicate of one of his jobs (as that user, so same conditions) except that the 8 cpu's were on different nodes. So far, all the nodes remain responsive. [I'd done something like this before, but it was a hectic time, so wanted to re-test.]

"sar" doesn't exist on the compute nodes (though the headnode has it). I don't know if this is by design or accident, but agree that output from sar would help.

Like I said, this isn't an earth-shattering issue -- it just keeps me from taking cpu temperatures, etc., as often as I want to. So, I can wait to upgrade.

This page:

http://www.redhat.com/security/updates/notes/

..doesn't list an update 8 for RHEL 3. Is that the one to wait for?

Again, thanks to all!
Vitaly Karasik_1
Honored Contributor

Re: 2-cpu Node unresponsive under heavy load

RHEL3 upd8 is available https://www.redhat.com/archives/taroon-list/2006-July/msg00046.html

And if Jaguar is just an application, without binary kernel modules, you can ask RH support for help.