- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: 2-cpu Node unresponsive under heavy load
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2006 08:02 AM
07-24-2006 08:02 AM
			
				
					
						
							2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
Sometimes, when users run Jaguar (a quantum mechanics package), one or more of their nodes become unresponsive during portions of the run. These runs typically go for several hours to several weeks and the node may be unresponsive for hours or days during that period (not that I check constantly, mind you).
Unresponsive means: usually responds to ping, will not allow ssh, will not allow control by the cluster management software (Scali), usually shows as "node alive" in the cluster management software, often will not allow local login (with keyboard, monitor & mouse attached to the node itself) -- the login times out while the password is being checked.
The user whose job is running, on the other hand, says that everything is just fine.
Schrodinger (who makes Jaguar) says the trouble is that the two cpu's are battling for access to the one hard drive. They say we should run multi-cpu jobs using one cpu per node. We don't much like the thought of doing that.
Does anyone here have experience with this? Is there a way to set access parameters for the HD? I'm thinking of something like the control one has over I/O with an nfs mount. Does that exist? Does anyone have any other solutions?
Thanks!
:-) Lachele
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2006 10:29 PM
07-24-2006 10:29 PM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
Do you see something interesting in /var/log/messages?
I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2006 03:26 AM
07-25-2006 03:26 AM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
I agree. This is the only program I've seen do this. Many other users run other programs, even other QM packages, without this issue.
"Do you see something interesting in /var/log/messages?"
Nope. No unusual entries at all.
"I'll suggest you to upgarde to the latest available kernel (and other RHEL updates as well)"
We do a lot of different things here and run a lot of different programs. Upgrading to fix one issue can mean breaking five others. So, we only make major changes when absolutely necessary. This problem doesn't fall into "absolutely necessary." Besides, I'd want to know that upgrading would actually fix the problem. Do you think it would?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2006 05:41 AM
07-25-2006 05:41 AM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2006 08:15 AM
07-25-2006 08:15 AM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
There are known bugs with RH clustering the kernel and releases even including RH 3 update 6. Though you are not using all of the componenets, you have a possible kernel issue.
You may find that the latest RH 3 update 8 kernel helps or if the applications support it an upgrade to the 2.6 kernel is appropriate to solve this issue.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2006 08:40 AM
07-25-2006 08:40 AM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2006 05:59 AM
07-26-2006 05:59 AM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
Last night, after negotiations with the user, I started a duplicate of one of his jobs (as that user, so same conditions) except that the 8 cpu's were on different nodes. So far, all the nodes remain responsive. [I'd done something like this before, but it was a hectic time, so wanted to re-test.]
"sar" doesn't exist on the compute nodes (though the headnode has it). I don't know if this is by design or accident, but agree that output from sar would help.
Like I said, this isn't an earth-shattering issue -- it just keeps me from taking cpu temperatures, etc., as often as I want to. So, I can wait to upgrade.
This page:
http://www.redhat.com/security/updates/notes/
..doesn't list an update 8 for RHEL 3. Is that the one to wait for?
Again, thanks to all!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2006 07:19 PM
07-26-2006 07:19 PM
			
				
					
						
							Re: 2-cpu Node unresponsive under heavy load
						
					
					
				
			
		
	
			
	
	
	
	
	
And if Jaguar is just an application, without binary kernel modules, you can ask RH support for help.
