Operating System - Linux
1753946 Members
7501 Online
108811 Solutions
New Discussion

Nice process, bad schedul, big trouble

 
RANNI
Occasional Contributor

Nice process, bad schedul, big trouble

Dear all,
I have some trouble with nx2 nastran process runing on a dl380 g3 (bi-processor and multithreading enable) with 2.4.18-14smp linux core.
I must say that this server was installed about 1 year and this problem never occur before.
This is the way the things goes :
The analisys process start well with a nice scheduling priority (10).
It does his work (compute and write about 100 Gb of data) taking almost 100 % of one cpu(we don't have nastran // license)
BUT
After a while( depending on the size of the computing project) it take 100 % of ALL cpu and never ending(i let him work 12hours for a a compute of 2 hours).
So i try to kill it, kill -15 : don't work, kill -9 : stay alive, reboot : don't want to know(the server don't reboot at all)

In fact, the process take 100% of cpu in the system mode waiting for some i/o ending( i think ?!)
I let the process running until an "ASR Detected by System ROM" reboot the system.
We restart the compute and the process goes the same way, every compute goes the same way.

After check hardware problem( i don't found any), i try to renice the analysis process level 5 nothing change, level 0 -> the process stop by itself after few second.

We restart a new compute nice level 0, the process stop by itself. But sometime the process nice level 0 goes in trouble again and i must renice it (-1,-2,-3,etc...)
My question is :
What this process expect from me ? more seriously i understand that the nice opération give a process higher scheduling priority, but it seems that whatever the priority the process freeze and the only way to stop it is to renice it with more priority.

The data are writing on local scsi hard disk, with enough free space. No error message in /var/log/*, dmesg, ilo log....
We stress the system(nfs, i/o writing, cpu, memory) during 12 hours and nothing seems to be wrong......
I try with multithreading enable then disable, with Ilo card enable/disable.
If you got any idea ......

Thanks a lot
Best Regards

laurent Ranni
1 REPLY 1
xyko_1
Esteemed Contributor

Re: Nice process, bad schedul, big trouble

Ranni,

I'm afraid that you have a software (nastran) problem. But you don't have a nastran license what makes the thing a lot more serious.

Lets try some things.

You can run process with less data to be processed, i.e., less resource consumming process ?

You can see if the memory becames stressed along the time the long runnig process works ?

Maybe your software problem (bug) appears only when a vary large amount of data has to be processed.

So, as you don't have a license that makes possible to report a bug, I suggest you, if possible, to break your long running process into 2 or more.

I know that it's not a great help but it's just what I can do for you.

Good luck.
Xyko