Showing results for 
Search instead for 
Did you mean: 

Re: server hang

Frequent Advisor

server hang

HP Proliant DL580 G5 server geeting hang.
OS Linux 2.6.18-92.1.18
Please verify the logs and help me to find out the root cause
Honored Contributor

Re: server hang

Because you've saved the log file in SSH packet dump mode, it's rather difficult to read the text.

But the cause of the hang is pretty obvious: the system is running out of memory (RAM).

Text from the beginning of the log dump:

>Eaudispd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

[the rest of the log seems to contain only oom-killer debugging information]

When a Linux system is using all the RAM and swap and automatically shrunk all OS caches to the minimum possible size, the system is truly out of memory. At this time, the kernel starts an "oom-killer" procedure, that is intended to find a process to kill and gain some memory that way. Ideally, the oom-killer should find the process that has caused the system to run out of memory and kill it, but this is actually very hard to implement reliably.

In practice, the oom-killer may kill processes somewhat randomly. It might kill the SSH daemon, or other processes essential for logging on to the system through the network. In this case, it may be very difficult to do anything other than reboot the system.

After the reboot, you can only examine the logs and/or improve the monitoring on the system, so that next time you will catch the memory shortage *before* it gets so bad that oom-killer starts.

When a system runs for a long time (days/weeks/months), the memory usage of all its applications should eventually stabilize to some value if the workload of the system remains constant. But if an application has a bug, it might keep allocating more and more memory without any limit at all, because it cannot re-use the memory it already has, or "forgets" that it has the memory. This is a "memory leak".

If you draw a graph of the application's memory usage, a memory leak presents a characteristic "sawtooth" pattern: when an application is started, its memory usage rapidly climbs to some initial value, then keeps steadily growing after that. Once the application is stopped and restarted, the memory usage again returns to the initial value (even if the workload is exactly the same as before the restart), then resumes the slow growth.

"atop" and its non-interactive component "atopsar" are good tools for catching memory leaks. There is also an optional "atopscripts" package, which contains a "findleak" script that can interpret the data collected by atopsar and list the processes whose size has been growing, ordered by the speed of growth.

Because stopping and restarting the application "resets" the memory leak, restarting the application periodically each night or weekend can be used as a work-around. But such a leak is always a bug in application and should be fixed.

(It was often thought that older versions of Windows required a "maintenance reboot" every now and then. This practice covered up a lot of memory leaks and allowed a culture of poor programming to develop.)

Frequent Advisor

Re: server hang


Thank you so much for your input . This will help us proceed further.