cancel
Showing results for 
Search instead for 
Did you mean: 

Server Hangs every 3 months

 
Larry UofM
Occasional Visitor

Server Hangs every 3 months

We have a few servers (1 highly visible) that are locking up every 3 months. The entire system becomes unresponsive and we must reboot. I have seen an issue in the past with dell hardware and DRAC cards, where they do a firmware restart and this causes the system to hang (was fixed with a firmware upgrade / kernel upgrade)

We are using ILO cards in these servers and I am not sure if this is the culprit or not.. There is nothing in the logs that show any sort of problem.

uname -a output

Linux ########## 2.6.5-7.286-bigsmp #1 SMP Thu May 31 10:12:58 UTC 2007 i686 athlon i386 GNU/Linux

anyone have any ideas, both Novell and HP are unable to come up with anything.

Thanks,
Larry
8 REPLIES
Ivan Ferreira
Honored Contributor

Re: Server Hangs every 3 months

You should enable the magic sysrq key and try to force a memory dump.

When I had a similar problem I configured a remote syslog server because the system was hang and cannot write to disk, but was able to send the message over the network and more infor was obtained to troubleshoot the problem.

Install collectl and enable performance logging. You could have an idea of what was going on in the system at the time of the hang.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
skt_skt
Honored Contributor

Re: Server Hangs every 3 months

As a test, you can simulate a crash with the Sysrq facility. You can test this by enabling sysrq and following this article:
http://kbase.redhat.com/faq/FAQ_80_5559.shtm

The 'c' character will simulate a crash. Time the core creation so that this will give you a guideline if there is a crash again. Do not manually reboot until AFTER the core is created.
Don Vanco - Linux Ninja
Regular Advisor

Re: Server Hangs every 3 months

Aside from the answers provided - have you looked over the sar data? Could it be something as simple as a filesystem filling up with temp data and killing the host?
Larry UofM
Occasional Visitor

Re: Server Hangs every 3 months

Here is the stack I got yesterday... oddly enough I am getting crashes allot more often now.

Unable to handle kernel NULL pointer dereference at virtual address 00000174
printing eip:
c015ff0f
*pde = 21d58001
Oops: 0000 [#1]
SMP
CPU: 0
EIP: 0060:[] Tainted: PF U
EFLAGS: 00010286 (2.6.5-7.287.3-bigsmp SLES9_SP3_BRANCH-20071002073136)
EIP is at blk_queue_bounce+0xf/0x310
eax: 00000000 ebx: f510dc68 ecx: 00000000 edx: d1bb5b44
esi: 00000000 edi: 00000000 ebp: 00000008 esp: d1bb5af0
ds: 007b es: 007b ss: 0068
Process novell-zislnxd (pid: 22349, threadinfo=d1bb4000 task=f2769980)
Stack: 00000001 00000001 cdfd2c50 ca852720 00000003 dabbee48 d1bb5b44 00000000
f510dc68 00000000 00000000 00000008 c026e21b f510de04 00000046 00000000
00000000 00000000 00000008 00000008 faa570a0 f510dc68 f510dc68 f510dc04
Call Trace:
[] __make_request+0x4b/0x530
[] MpcPathWeightForAdaptive+0x0/0x130 [emcpmpc]
[] PowerPlatformBottomDispatch+0x3b3/0x470 [emcp]
[] MpcDispatchGuts+0xb0/0xc0 [emcpmpc]
[] PowerTopDispatch+0x10b/0x320 [emcp]
[] allocPio+0x12/0x20 [emcp]
[] emcp_native_mrf+0x56/0x90 [emcp]
[] generic_make_request+0x11d/0x200
[] mempool_alloc+0x74/0x130
[] autoremove_wake_function+0x0/0x40
[] submit_bio+0x63/0x120
[] autoremove_wake_function+0x0/0x40
[] bio_alloc+0xe2/0x1d0
[] submit_bh+0x17d/0x220
[] block_read_full_page+0x367/0x370
[] blkdev_get_block+0x0/0x80
[] add_to_page_cache+0x57/0x180
[] read_pages+0x130/0x1b0
[] __alloc_pages+0xb4/0x430
[] blockable_page_cache_readahead+0x12b/0x1a0
[] page_cache_readahead+0x243/0x300
[] do_generic_mapping_read+0x41c/0x7d0
[] flush_tlb_page+0x59/0xe0
[] __generic_file_aio_read+0x1e2/0x220
[] file_read_actor+0x0/0xf0
[] generic_file_read+0x98/0xc0
[] autoremove_wake_function+0x0/0x40
[] sys_wait4+0x195/0x5c0
[] __pollwait+0x0/0x120
[] vfs_read+0xc6/0x160
[] sys_read+0x91/0xf0
[] sysenter_past_esp+0x52/0x71

Code: f6 80 74 01 00 00 01 0f 85 e4 01 00 00 8b 54 24 1c a1 c8 8f
done waiting: 3 cpus not responding
Dumping to block device (104,1) on CPU 0 ...
Avijit Patra
Occasional Advisor

Re: Server Hangs every 3 months

Does the server have any scheduled job at that interval?
MarkSeger
Frequent Advisor

Re: Server Hangs every 3 months

I agree with the previous reply about running collectl - probably because I wrote it. 9-)

The important thing to remember if you run collectl or even sar is to have a farily high monitoring frequency and I know most sar users monitor once every 10 minutes. By default collectl monitors once every 10 seconds, but even at that frequency it typically uses <0.1% of the cpu.

Once you've collected a pile of data with it you can then play it back and look at a variety of data in a variety of formats showing most of the types of things sar shows and then some. The key things I'd look for are system resources that are going up in consumption as well as what was going on at the time of the 'lock up', assuming you know the approximate time.

One resource people often miss (probably because there aren't any other utilities I know of that will log their usage) is 'slabs'. Collectl will show you the amount of memory allocated to slabs when you show memory usage but if in fact you think you are seeing an issue, you can also look at changes over time to individual slabs. Since slab monitoring (and process monitoring too for that matter) are more expensive to monitor than the other types of data, those subsystems are monitored once a minute in order to stay within that <0.1% overhead window.

Just keep in mind that by default collectl will write its data to a log in /var/log/collect, creating a new log every day and retaining 7 previous ones. If you do need to keep more, you can modify the number in /etc/collectl.conf.

check it out at http://collectl.sourceforge.net/ and enjoy

-mark
Ivan Ferreira
Honored Contributor

Re: Server Hangs every 3 months

>>>> Process novell-zislnxd (pid: 22349, threadinfo=d1bb4000 task=f2769980)

Contact Novell support. It looks like a problem in Novell ZENworks Linux Management.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Larry UofM
Occasional Visitor

Re: Server Hangs every 3 months

According to novell this is a problem with EMC PowerPath not zlm... Oddly enough this also happens to servers not running powerpath, I have yet to be able to get a core dump from these servers (they crash less often than the others)